[python]Mongodb

文档:

http://api.mongodb.com/python/current/tutorial.html

安装:

官网直接下载安装, mac上brew安装的下载太慢, 打算手动安装

使用:

开启服务:

1 mongod #默认配置开启服务
2 mongod -- dpath <db path> # 指定数据库文件路径

连接服务:

1 mongo # 默认配置连接
2 mongo [options] [db address] [file names (ending in .js)]

图形可视化程序:

https://www.robomongo.org/

shell:

 1 > help
 2     db.help()                    help on db methods
 3     db.mycoll.help()             help on collection methods
 4     sh.help()                    sharding helpers
 5     rs.help()                    replica set helpers
 6     help admin                   administrative help
 7     help connect                 connecting to a db help
 8     help keys                    key shortcuts
 9     help misc                    misc things to know
10     help mr                      mapreduce
11 
12     show dbs                     show database names
13     show collections             show collections in current database
14     show users                   show users in current database
15     show profile                 show most recent system.profile entries with time >= 1ms
16     show logs                    show the accessible logger names
17     show log [name]              prints out the last segment of log in memory, 'global' is default
18     use <db_name>                set current database
19     db.foo.find()                list objects in collection foo
20     db.foo.find( { a : 1 } )     list objects in foo where a == 1
21     it                           result of the last line evaluated; use to further iterate
22     DBQuery.shellBatchSize = x   set default number of items to display on shell
23     exit                         quit the mongo shell

more helps...

 1 > db.help()
 2 DB methods:
 3     db.adminCommand(nameOrDocument) - switches to 'admin' db, and runs command [just calls db.runCommand(...)]
 4     db.aggregate([pipeline], {options}) - performs a collectionless aggregation on this database; returns a cursor
 5     db.auth(username, password)
 6     db.cloneDatabase(fromhost)
 7     db.commandHelp(name) returns the help for the command
 8     db.copyDatabase(fromdb, todb, fromhost)
 9     db.createCollection(name, {size: ..., capped: ..., max: ...})
10     db.createView(name, viewOn, [{$operator: {...}}, ...], {viewOptions})
11     db.createUser(userDocument)
12     db.currentOp() displays currently executing operations in the db
13     db.dropDatabase()
14     db.eval() - deprecated
15     db.fsyncLock() flush data to disk and lock server for backups
16     db.fsyncUnlock() unlocks server following a db.fsyncLock()
17     db.getCollection(cname) same as db['cname'] or db.cname
18     db.getCollectionInfos([filter]) - returns a list that contains the names and options of the db's collections
19     db.getCollectionNames()
20     db.getLastError() - just returns the err msg string
21     db.getLastErrorObj() - return full status object
22     db.getLogComponents()
23     db.getMongo() get the server connection object
24     db.getMongo().setSlaveOk() allow queries on a replication slave server
25     db.getName()
26     db.getPrevError()
27     db.getProfilingLevel() - deprecated
28     db.getProfilingStatus() - returns if profiling is on and slow threshold
29     db.getReplicationInfo()
30     db.getSiblingDB(name) get the db at the same server as this one
31     db.getWriteConcern() - returns the write concern used for any operations on this db, inherited from server object if set
32     db.hostInfo() get details about the server's host
33     db.isMaster() check replica primary status
34     db.killOp(opid) kills the current operation in the db
35     db.listCommands() lists all the db commands
36     db.loadServerScripts() loads all the scripts in db.system.js
37     db.logout()
38     db.printCollectionStats()
39     db.printReplicationInfo()
40     db.printShardingStatus()
41     db.printSlaveReplicationInfo()
42     db.dropUser(username)
43     db.repairDatabase()
44     db.resetError()
45     db.runCommand(cmdObj) run a database command.  if cmdObj is a string, turns it into {cmdObj: 1}
46     db.serverStatus()
47     db.setLogLevel(level,<component>)
48     db.setProfilingLevel(level,slowms) 0=off 1=slow 2=all
49     db.setWriteConcern(<write concern doc>) - sets the write concern for writes to the db
50     db.unsetWriteConcern(<write concern doc>) - unsets the write concern for writes to the db
51     db.setVerboseShell(flag) display extra information in shell output
52     db.shutdownServer()
53     db.stats()
54     db.version() current version of the server
55 >

DB methods

 1 > db.mycoll.help()
 2 DBCollection help
 3     db.mycoll.find().help() - show DBCursor help
 4     db.mycoll.bulkWrite( operations, <optional params> ) - bulk execute write operations, optional parameters are: w, wtimeout, j
 5     db.mycoll.count( query = {}, <optional params> ) - count the number of documents that matches the query, optional parameters are: limit, skip, hint, maxTimeMS
 6     db.mycoll.copyTo(newColl) - duplicates collection by copying all documents to newColl; no indexes are copied.
 7     db.mycoll.convertToCapped(maxBytes) - calls {convertToCapped:'mycoll', size:maxBytes}} command
 8     db.mycoll.createIndex(keypattern[,options])
 9     db.mycoll.createIndexes([keypatterns], <options>)
10     db.mycoll.dataSize()
11     db.mycoll.deleteOne( filter, <optional params> ) - delete first matching document, optional parameters are: w, wtimeout, j
12     db.mycoll.deleteMany( filter, <optional params> ) - delete all matching documents, optional parameters are: w, wtimeout, j
13     db.mycoll.distinct( key, query, <optional params> ) - e.g. db.mycoll.distinct( 'x' ), optional parameters are: maxTimeMS
14     db.mycoll.drop() drop the collection
15     db.mycoll.dropIndex(index) - e.g. db.mycoll.dropIndex( "indexName" ) or db.mycoll.dropIndex( { "indexKey" : 1 } )
16     db.mycoll.dropIndexes()
17     db.mycoll.ensureIndex(keypattern[,options]) - DEPRECATED, use createIndex() instead
18     db.mycoll.explain().help() - show explain help
19     db.mycoll.reIndex()
20     db.mycoll.find([query],[fields]) - query is an optional query filter. fields is optional set of fields to return.
21                                                   e.g. db.mycoll.find( {x:77} , {name:1, x:1} )
22     db.mycoll.find(...).count()
23     db.mycoll.find(...).limit(n)
24     db.mycoll.find(...).skip(n)
25     db.mycoll.find(...).sort(...)
26     db.mycoll.findOne([query], [fields], [options], [readConcern])
27     db.mycoll.findOneAndDelete( filter, <optional params> ) - delete first matching document, optional parameters are: projection, sort, maxTimeMS
28     db.mycoll.findOneAndReplace( filter, replacement, <optional params> ) - replace first matching document, optional parameters are: projection, sort, maxTimeMS, upsert, returnNewDocument
29     db.mycoll.findOneAndUpdate( filter, update, <optional params> ) - update first matching document, optional parameters are: projection, sort, maxTimeMS, upsert, returnNewDocument
30     db.mycoll.getDB() get DB object associated with collection
31     db.mycoll.getPlanCache() get query plan cache associated with collection
32     db.mycoll.getIndexes()
33     db.mycoll.group( { key : ..., initial: ..., reduce : ...[, cond: ...] } )
34     db.mycoll.insert(obj)
35     db.mycoll.insertOne( obj, <optional params> ) - insert a document, optional parameters are: w, wtimeout, j
36     db.mycoll.insertMany( [objects], <optional params> ) - insert multiple documents, optional parameters are: w, wtimeout, j
37     db.mycoll.mapReduce( mapFunction , reduceFunction , <optional params> )
38     db.mycoll.aggregate( [pipeline], <optional params> ) - performs an aggregation on a collection; returns a cursor
39     db.mycoll.remove(query)
40     db.mycoll.replaceOne( filter, replacement, <optional params> ) - replace the first matching document, optional parameters are: upsert, w, wtimeout, j
41     db.mycoll.renameCollection( newName , <dropTarget> ) renames the collection.
42     db.mycoll.runCommand( name , <options> ) runs a db command with the given name where the first param is the collection name
43     db.mycoll.save(obj)
44     db.mycoll.stats({scale: N, indexDetails: true/false, indexDetailsKey: <index key>, indexDetailsName: <index name>})
45     db.mycoll.storageSize() - includes free space allocated to this collection
46     db.mycoll.totalIndexSize() - size in bytes of all the indexes
47     db.mycoll.totalSize() - storage allocated for all data and indexes
48     db.mycoll.update( query, object[, upsert_bool, multi_bool] ) - instead of two flags, you can pass an object with fields: upsert, multi
49     db.mycoll.updateOne( filter, update, <optional params> ) - update the first matching document, optional parameters are: upsert, w, wtimeout, j
50     db.mycoll.updateMany( filter, update, <optional params> ) - update all matching documents, optional parameters are: upsert, w, wtimeout, j
51     db.mycoll.validate( <full> ) - SLOW
52     db.mycoll.getShardVersion() - only for use with sharding
53     db.mycoll.getShardDistribution() - prints statistics about data distribution in the cluster
54     db.mycoll.getSplitKeysForChunks( <maxChunkSize> ) - calculates split points over all chunks and returns splitter function
55     db.mycoll.getWriteConcern() - returns the write concern used for any operations on this collection, inherited from server/db if set
56     db.mycoll.setWriteConcern( <write concern doc> ) - sets the write concern for writes to the collection
57     db.mycoll.unsetWriteConcern( <write concern doc> ) - unsets the write concern for writes to the collection
58     db.mycoll.latencyStats() - display operation latency histograms for this collection
59 >

Collection methods

 1 > sh.help()
 2     sh.addShard( host )                       server:port OR setname/server:port
 3     sh.addShardToZone(shard,zone)             adds the shard to the zone
 4     sh.updateZoneKeyRange(fullName,min,max,zone)      assigns the specified range of the given collection to a zone
 5     sh.disableBalancing(coll)                 disable balancing on one collection
 6     sh.enableBalancing(coll)                  re-enable balancing on one collection
 7     sh.enableSharding(dbname)                 enables sharding on the database dbname
 8     sh.getBalancerState()                     returns whether the balancer is enabled
 9     sh.isBalancerRunning()                    return true if the balancer has work in progress on any mongos
10     sh.moveChunk(fullName,find,to)            move the chunk where 'find' is to 'to' (name of shard)
11     sh.removeShardFromZone(shard,zone)      removes the shard from zone
12     sh.removeRangeFromZone(fullName,min,max)   removes the range of the given collection from any zone
13     sh.shardCollection(fullName,key,unique,options)   shards the collection
14     sh.splitAt(fullName,middle)               splits the chunk that middle is in at middle
15     sh.splitFind(fullName,find)               splits the chunk that find is in at the median
16     sh.startBalancer()                        starts the balancer so chunks are balanced automatically
17     sh.status()                               prints a general overview of the cluster
18     sh.stopBalancer()                         stops the balancer so chunks are not balanced automatically
19     sh.disableAutoSplit()                   disable autoSplit on one collection
20     sh.enableAutoSplit()                    re-enable autoSplit on one collection
21     sh.getShouldAutoSplit()                 returns whether autosplit is enabled
22 >

sharding helpers

 1 > rs.help()
 2     rs.status()                                { replSetGetStatus : 1 } checks repl set status
 3     rs.initiate()                              { replSetInitiate : null } initiates set with default settings
 4     rs.initiate(cfg)                           { replSetInitiate : cfg } initiates set with configuration cfg
 5     rs.conf()                                  get the current configuration object from local.system.replset
 6     rs.reconfig(cfg)                           updates the configuration of a running replica set with cfg (disconnects)
 7     rs.add(hostportstr)                        add a new member to the set with default attributes (disconnects)
 8     rs.add(membercfgobj)                       add a new member to the set with extra attributes (disconnects)
 9     rs.addArb(hostportstr)                     add a new member which is arbiterOnly:true (disconnects)
10     rs.stepDown([stepdownSecs, catchUpSecs])   step down as primary (disconnects)
11     rs.syncFrom(hostportstr)                   make a secondary sync from the given member
12     rs.freeze(secs)                            make a node ineligible to become primary for the time specified
13     rs.remove(hostportstr)                     remove a host from the replica set (disconnects)
14     rs.slaveOk()                               allow queries on secondary nodes
15 
16     rs.printReplicationInfo()                  check oplog size and time range
17     rs.printSlaveReplicationInfo()             check replica set members and replication lag
18     db.isMaster()                              check who is primary
19 
20     reconfiguration helpers disconnect from the database so the shell will display
21     an error, even if the command succeeds.
22 >

replica set helpers

 1 > help admin
 2     ls([path])                      list files
 3     pwd()                           returns current directory
 4     listFiles([path])               returns file list
 5     hostname()                      returns name of this host
 6     cat(fname)                      returns contents of text file as a string
 7     removeFile(f)                   delete a file or directory
 8     load(jsfilename)                load and execute a .js file
 9     run(program[, args...])         spawn a program and wait for its completion
10     runProgram(program[, args...])  same as run(), above
11     sleep(m)                        sleep m milliseconds
12     getMemInfo()                    diagnostic
13 >

administrative help

 1 > help connect
 2 
 3 Normally one specifies the server on the mongo shell command line.  Run mongo --help to see those options.
 4 Additional connections may be opened:
 5 
 6     var x = new Mongo('host[:port]');
 7     var mydb = x.getDB('mydb');
 8   or
 9     var mydb = connect('host[:port]/mydb');
10 
11 Note: the REPL prompt only auto-reports getLastError() for the shell command line connection.
12 
13 >

connect db help

 1 > help keys
 2 Tab completion and command history is available at the command prompt.
 3 
 4 Some emacs keystrokes are available too:
 5   Ctrl-A start of line
 6   Ctrl-E end of line
 7   Ctrl-K del to end of line
 8 
 9 Multi-line commands
10 You can enter a multi line javascript expression.  If parens, braces, etc. are not closed, you will see a new line
11 beginning with '...' characters.  Type the rest of your expression.  Press Ctrl-C to abort the data entry if you
12 get stuck.
13 
14 >

shotcut keys

 1 > help misc
 2     b = new BinData(subtype,base64str)  create a BSON BinData value
 3     b.subtype()                         the BinData subtype (0..255)
 4     b.length()                          length of the BinData data in bytes
 5     b.hex()                             the data as a hex encoded string
 6     b.base64()                          the data as a base 64 encoded string
 7     b.toString()
 8 
 9     b = HexData(subtype,hexstr)         create a BSON BinData value from a hex string
10     b = UUID(hexstr)                    create a BSON BinData value of UUID subtype
11     b = MD5(hexstr)                     create a BSON BinData value of MD5 subtype
12     "hexstr"                            string, sequence of hex characters (no 0x prefix)
13 
14     o = new ObjectId()                  create a new ObjectId
15     o.getTimestamp()                    return timestamp derived from first 32 bits of the OID
16     o.isObjectId
17     o.toString()
18     o.equals(otherid)
19 
20     d = ISODate()                       like Date() but behaves more intuitively when used
21     d = ISODate('YYYY-MM-DD hh:mm:ss')    without an explicit "new " prefix on construction
22 >

misc

 1 > help mr
 2 
 3 See also http://dochub.mongodb.org/core/mapreduce
 4 
 5 function mapf() {
 6   // 'this' holds current document to inspect
 7   emit(key, value);
 8 }
 9 
10 function reducef(key,value_array) {
11   return reduced_value;
12 }
13 
14 db.mycollection.mapReduce(mapf, reducef[, options])
15 
16 options
17 {[query : <query filter object>]
18  [, sort : <sort the query.  useful for optimization>]
19  [, limit : <number of objects to return from collection>]
20  [, out : <output-collection name>]
21  [, keeptemp: <true|false>]
22  [, finalize : <finalizefunction>]
23  [, scope : <object where fields go into javascript global scope >]
24  [, verbose : true]}
25 
26 >

python驱动

pip install pymongo

scrapy:

settings.py

1 ITEM_PIPELINES = ['stack.pipelines.MongoDBPipeline', ]
2 
3 MONGODB_SERVER = "localhost"
4 MONGODB_PORT = 27017
5 MONGODB_DB = "stackoverflow"
6 MONGODB_COLLECTION = "questions"

piplines.py

 1 import pymongo
 2 
 3 from scrapy.conf import settings
 4 from scrapy.exceptions import DropItem
 5 from scrapy import log
 6 
 7 
 8 class MongoDBPipeline(object):
 9 
10     def __init__(self):
11         connection = pymongo.MongoClient(
12             settings['MONGODB_SERVER'],
13             settings['MONGODB_PORT']
14         )
15         db = connection[settings['MONGODB_DB']]
16         self.collection = db[settings['MONGODB_COLLECTION']]
17 
18     def process_item(self, item, spider):
19         valid = True
20         for data in item:
21             if not data:
22                 valid = False
23                 raise DropItem("Missing {0}!".format(data))
24         if valid:
25             self.collection.insert(dict(item))
26             log.msg("Question added to MongoDB database!",
27                     level=log.DEBUG, spider=spider)
28         return item

scrapy 官方文档 https://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb:

piplines.py

 1 import pymongo
 2 
 3 class MongoPipeline(object):
 4 
 5     collection_name = 'scrapy_items'
 6 
 7     def __init__(self, mongo_uri, mongo_db):
 8         self.mongo_uri = mongo_uri
 9         self.mongo_db = mongo_db
10 
11     @classmethod
12     def from_crawler(cls, crawler):
13         return cls(
14             mongo_uri=crawler.settings.get('MONGO_URI'),
15             mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
16         )
17 
18     def open_spider(self, spider):
19         self.client = pymongo.MongoClient(self.mongo_uri)
20         self.db = self.client[self.mongo_db]
21 
22     def close_spider(self, spider):
23         self.client.close()
24 
25     def process_item(self, item, spider):
26         self.db[self.collection_name].insert_one(dict(item))
27         return item