Getting Start on Mongodb

題記:

    最近老和同學聊到non-relational-db的領域,今天恰巧看到robbin大哥對這個領域的見解,讓我心情澎拜。

 

 

    WEB2.0的興起暴露了關係型數據庫的弊端,推動了非關係型數據庫的發展。

    對於WEB應用,強調了高讀寫操作,海量數據存儲,橫向擴展,正如robbin大哥說的,關係型數據庫的優點在WEB應用面前變得無用武之地:事務一致性、多表查詢。

    解決高讀寫操作則犧牲一致性,內存操作,並異步flush到文件系統;

    解決海量數據則寫自己的文件系統;

    解決橫向擴展需要解決集羣的可拔插。

 

各類non-relational數據庫都有各自的特點,MongoDB能支撐海量數據,TC/TT提供很好的高併發讀寫性能,Cassandra適合集羣(它更像一組網絡服務爾非數據庫)。

 

個人覺得海量數據和對用戶提供的高併發必然集羣,所以即便MongoDB很好支持了海量存儲,但不知集羣方面做的怎樣,所以Cassandra是值得學習的。

   

關於NOSQL的產品之一——MongoDB

 

MongoDB—Kyle Banker—10gen

MongoDB is JSON document oriented database. These documents are stored in the database as BSON (binary JSON). BSON is efficient, fast, and is richer in type than JSON (i.e. regex support). Documents are grouped in collections which are analogous to relational tables, but are schema free.

GridFS is a specication for storing large binary files like images and videos in MongoDB. Every document has a 4MB limit. GridFS chuncs the large files into such 4MB parts inside a collection, with a saperate metadata collection. MusicNation.com stores all music and video alongside the application data in MongoDB (about 1TB).

MongoDB has its own wire protocol with socket drivers for several languages. The drivers serializes the data to BSON before transfer.

Replication is used for failover and redundancy. Most commonly a master-slave setup is used. It’s also possible to setup a replica pair architecture.

MongoDB provides a custom query language which should be as powerful as SQL. MongoDB understands the internal structures of its documents which enables dynamic queries. Map/reduce functions are also supported in the query language.

BusinessInsider.com has been using MongoDB for two years with 12M page views/month. They like the simplification of the data model. Posts for instance have embedded comments. They also store real-time analytics in MongoDB which enables fast inserts and eased data analysis with dymanic queries. Uses a single MongoDB database server, 3 Apache web servers, and Memcached caching only on the front page.

TweetCongress.org are users of MongoDB and likes that code defines the schema, and one can therefore version control the schema. They use a single master with snapshots on a 64-bit EC2 instance.

SourceForge.net had a large redesign this summer where they moved to MongoDB. Their goal was to store the front pages, project pages, and download pages in a single document. It’s deployed with one master and 5-6 read-only slaves (obviously scaled for reads and reliability).

 

 

Download

The easiest (and recommended) way to install MongoDB is to use the pre-built binaries.

32-bit binaries

Download and extract the 32-bit .zip. The "Production" build is recommended.

64-bit binaries

Download and extract the 64-bit .zip.

Note: 64-bit is recommended, although you must have a 64-bit version of Windows to run that version.

Unzip

Unzip the downloaded binary package to the location of your choice. You may want to rename mongo-xxxxxxx to just "mongo" for convenience.

Create a data directory

By default MongoDB will store data in C:\data\db, but it won't automatically create that folder, so we do so here:

C:\> mkdir \data
C:\> mkdir \data\db

Or you can do this from the Windows Explorer, of course.

Run and connect to the server

The important binaries for a first run are:

  • mongod.exe - the database server
  • mongo.exe - the administrative shell

To run the database, click mongod.exe in Explorer, or run it from a CMD window.

C:\> cd \my_mongo_dir\bin
C:\my_mongo_dir\bin > mongod
Getting A Database Connection

Let's now try manipulating the database with the database shell . (We could perform similar operations from any programming language using an appropriate driver.  The shell is convenient for interactive and administrative use.)

Start the MongoDB JavaScript shell with:

 

Connect to a database server running locally on the default port:

mongodb://localhost

Connect and login to the admin database as user "fred" with password "foobar":

mongodb://fred:foobar@localhost

Connect and login to the "baz" database as user "fred" with password "foobar":

mongodb://fred:foobar@localhost/baz

Connect to a replica pair, with one server on example1.com and another server on example2.com:

mongodb://example1.com:27017,example2.com:27017

Connect to a replica set with three servers running on localhost (on ports 27017, 27018, and 27019):

mongodb://localhost,localhost:27018,localhost:27019

 

 

 

"connecting to:" tells you the name of the database the shell is using. To switch databases, type:

> use mydb
switched to db mydb

To see a list of handy commands, type help.

Tip for Developers with Experience in Other Databases
You may notice, in the examples below, that we never create a database or collection. MongoDB does not require that you do so. As soon as you insert something, MongoDB creates the underlying collection and database. If you query a collection that does not exist, MongoDB treats it as an empty collection.

Switching to a database with the use command won't immediately create the database - the database is created lazily the first time data is inserted. This means that if you use a database for the first time it won't show up in the list provided by `show dbs` until data is inserted.

Each MongoDB server can support multiple databases. Each database is independent, and the data for each database is stored separately, for security and ease of management.

 

 

Using a Large Number of Collections 

 

Generally, having a large number of collections has no significant performance penalty, and results in very good performance.

Limits

By default MongoDB has a limit of approximately 24,000 namespaces per database.  Each collection counts as a namespace, as does each index.  Thus if every collection had one index, we can create up to 12,000 collections.  Use the --nssize parameter to set a higher limit.

Be aware that there is a certain minimum overhead per collection -- a few KB.  Further, any index will require at least 8KB of data space as the b-tree page size is 8KB.

--nssize

If more collections are required, run mongod with the --nssize parameter specified.  This will make the <database>.ns file larger and support more collections.  Note that --nssize sets the size used for newly created .ns files -- if you have an existing database and wish to resize, after running the db with --nssize, run the db.repairDatabase() command from the shell to adjust the size.

Maximum .ns file size is 2GB.

 

 

MongoDB (BSON) Data Types

Mongo uses special data types in addition to the basic JSON types of string, integer, boolean, double, null, array, and object. These types include date, object id, binary data, regular expression, and code. Each driver implements these types in language-specific ways, see your driver's documentation for details.

 

 

GridFS

<!-- Root decorator: this is a layer of abstraction that Confluence doesn't need. It will be removed eventually. -->

<!-- wiki content -->

GridFS is a specification for storing large files in MongoDB. All of the officially supported driver implement the GridFS spec.

 

 

JSON

For example the following "document" can be stored in Mongo DB:

{ author: 'joe',
  created : new Date('03/28/2009'),
  title : 'Yet another blog post',
  text : 'Here is the text...',
  tags : [ 'example', 'joe' ],
  comments : [ { author: 'jim', comment: 'I disagree' },
              { author: 'nancy', comment: 'Good post' }
  ]
}

This document is a blog post, so we can store in a "posts" collection using the shell:

> doc = { author : 'joe', created : new Date('03/28/2009'), ... }
> db.posts.insert(doc);

MongoDB understands the internals of BSON objects -- not only can it store them, it can query on internal fields and index keys based upon them.  For example the query

> db.posts.find( { "comments.author" : "jim" } )

is possible and means "find any blog post where at least one comment subjobject has author == 'jim'".

 

  

> j = { name : "mongo" };
{"name" : "mongo"}
> t = { x : 3 };
{ "x" : 3  }
> db.things.save(j);
> db.things.save(t);
> db.things.find();
{ "_id" : ObjectId("4c2209f9f3924d31102bd84a"), "name" : "mongo" }
{ "_id" : ObjectId("4c2209fef3924d31102bd84b"), "x" : 3 }
>

 

 

 

Let's add some more records to this collection:

> for (var i = 1; i <= 20; i++) db.things.save({x : 4, j : i});
> db.things.find();
{ "_id" : ObjectId("4c2209f9f3924d31102bd84a"), "name" : "mongo" }
{ "_id" : ObjectId("4c2209fef3924d31102bd84b"), "x" : 3 }
{ "_id" : ObjectId("4c220a42f3924d31102bd856"), "x" : 4, "j" : 1 }
{ "_id" : ObjectId("4c220a42f3924d31102bd857"), "x" : 4, "j" : 2 }
{ "_id" : ObjectId("4c220a42f3924d31102bd858"), "x" : 4, "j" : 3 }
{ "_id" : ObjectId("4c220a42f3924d31102bd859"), "x" : 4, "j" : 4 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85a"), "x" : 4, "j" : 5 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85b"), "x" : 4, "j" : 6 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85c"), "x" : 4, "j" : 7 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85d"), "x" : 4, "j" : 8 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85e"), "x" : 4, "j" : 9 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85f"), "x" : 4, "j" : 10 }
{ "_id" : ObjectId("4c220a42f3924d31102bd860"), "x" : 4, "j" : 11 }
{ "_id" : ObjectId("4c220a42f3924d31102bd861"), "x" : 4, "j" : 12 }
{ "_id" : ObjectId("4c220a42f3924d31102bd862"), "x" : 4, "j" : 13 }
{ "_id" : ObjectId("4c220a42f3924d31102bd863"), "x" : 4, "j" : 14 }
{ "_id" : ObjectId("4c220a42f3924d31102bd864"), "x" : 4, "j" : 15 }
{ "_id" : ObjectId("4c220a42f3924d31102bd865"), "x" : 4, "j" : 16 }
{ "_id" : ObjectId("4c220a42f3924d31102bd866"), "x" : 4, "j" : 17 }
{ "_id" : ObjectId("4c220a42f3924d31102bd867"), "x" : 4, "j" : 18 }
has more

Querying

<!-- Root decorator: this is a layer of abstraction that Confluence doesn't need. It will be removed eventually. -->
<!-- wiki content -->

One of MongoDB's best capabilities is its support for dynamic (ad hoc) queries. Systems that support dynamic queries don't require any special indexing to find data; users can find data using any criteria. For relational databases, dynamic queries are the norm. If you're moving to MongoDB from a relational databases, you'll find that many SQL queries translate easily to MongoDB's document-based query language.

return every document in the users collection. Our query would look like this:
  db.users.find({})

In this case, our selector is an empty document, which matches every document in the collection. Here's a more selective example:

  db.users.find({'last_name': 'Smith'})

Here our selector will match every document where the last_name attribute is 'Smith.'

Field Selection

In addition to the query expression, MongoDB queries can take some additional arguments. For example, it's possible to request only certain fields be returned. If we just wanted the social security numbers of users with the last name of 'Smith,' then from the shell we could issue this query:

  // retrieve ssn field for documents where last_name == 'Smith':
  db.users.find({last_name: 'Smith'}, {'ssn': 1});

  // retrieve all fields *except* the thumbnail field, for all documents:
  db.users.find({}, {thumbnail:0});

Note the _id field is always returned even when not explicitly requested.

Sorting

MongoDB queries can return sorted results. To return all documents and sort by last name in ascending order, we'd query like so:

  db.users.find({}).sort({last_name: 1});

Skip and Limit

MongoDB also supports skip and limit for easy paging. Here we skip the first 20 last names, and limit our result set to 10:

db.users.find().skip(20).limit(10);
db.users.find({}, {}, 10, 20); // same as above, but less clear

slaveOk

When querying a replica pair or replica set, drivers route their requests to the master mongod by default; to perform a query against an (arbitrarily-selected) slave, the query can be run with the slaveOk option. Here's how to do so in the shell:

db.getMongo().setSlaveOk(); // enable querying a slave
db.users.find(...)

Note: some language drivers permit specifying the slaveOk option on each find(), others make this a connection-wide setting. See your language's driver for details.

Cursors

Database queries, performed with the find() method, technically work by returning a cursor. Cursors are then used to iteratively retrieve all the documents returned by the query. For example, we can iterate over a cursor in the mongo shell like this:

> var cur = db.example.find();
> cur.forEach( function(x) { print(tojson(x))});
{"n" : 1 , "_id" : "497ce96f395f2f052a494fd4"}
{"n" : 2 , "_id" : "497ce971395f2f052a494fd5"}
{"n" : 3 , "_id" : "497ce973395f2f052a494fd6"}
>

Removing Objects from a Collection

To remove objects from a collection, use the remove() function in the mongo shell. (Other drivers offer a similar
function, but may call the function "delete". Please check your driver's documentation ).

remove() is like find() in that it takes a JSON-style query document as an argument to select which documents are removed. If you call remove() without a document argument, or with an empty document {}, it will remove all documents in the collection. Some examples :

db.things.remove({});    // removes all
db.things.remove({n:1}); // removes all where n == 1

If you have a document in memory and wish to delete it, the most efficient method is to specify the item's document _id value as a criteria:

db.things.remove({_id: myobject._id});

You may be tempted to simply pass the document you wish to delete as the selector, and this will work, but it's inefficient.

SELECT * FROM things WHERE name="mongo"
> db.things.find({name:"mongo"}).forEach(printjson);
{ "_id" : ObjectId("4c2209f9f3924d31102bd84a"), "name" : "mongo" }
SELECT * FROM things WHERE x=4
> db.things.find({x:4}).forEach(printjson);
{ "_id" : ObjectId("4c220a42f3924d31102bd856"), "x" : 4, "j" : 1 }
{ "_id" : ObjectId("4c220a42f3924d31102bd857"), "x" : 4, "j" : 2 }
{ "_id" : ObjectId("4c220a42f3924d31102bd858"), "x" : 4, "j" : 3 }
{ "_id" : ObjectId("4c220a42f3924d31102bd859"), "x" : 4, "j" : 4 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85a"), "x" : 4, "j" : 5 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85b"), "x" : 4, "j" : 6 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85c"), "x" : 4, "j" : 7 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85d"), "x" : 4, "j" : 8 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85e"), "x" : 4, "j" : 9 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85f"), "x" : 4, "j" : 10 }
{ "_id" : ObjectId("4c220a42f3924d31102bd860"), "x" : 4, "j" : 11 }
{ "_id" : ObjectId("4c220a42f3924d31102bd861"), "x" : 4, "j" : 12 }
{ "_id" : ObjectId("4c220a42f3924d31102bd862"), "x" : 4, "j" : 13 }
{ "_id" : ObjectId("4c220a42f3924d31102bd863"), "x" : 4, "j" : 14 }
{ "_id" : ObjectId("4c220a42f3924d31102bd864"), "x" : 4, "j" : 15 }
{ "_id" : ObjectId("4c220a42f3924d31102bd865"), "x" : 4, "j" : 16 }
{ "_id" : ObjectId("4c220a42f3924d31102bd866"), "x" : 4, "j" : 17 }
{ "_id" : ObjectId("4c220a42f3924d31102bd867"), "x" : 4, "j" : 18 }
{ "_id" : ObjectId("4c220a42f3924d31102bd868"), "x" : 4, "j" : 19 }
{ "_id" : ObjectId("4c220a42f3924d31102bd869"), "x" : 4, "j" : 20 }

The query expression is an document itself. A query document of the form { a:A, b:B, ... } means "where a==A and b==B and ...". More information on query capabilities may be found in the Queries and Cursors section of the Mongo Developers' Guide.

 

 

To illustrate, lets repeat the last example find({x:4}) with an additional argument that limits the returned document to just the "j" elements:

SELECT j FROM things WHERE x=4
> db.things.find({x:4}, {j:true}).forEach(printjson);
{ "_id" : ObjectId("4c220a42f3924d31102bd856"), "j" : 1 }
{ "_id" : ObjectId("4c220a42f3924d31102bd857"), "j" : 2 }
{ "_id" : ObjectId("4c220a42f3924d31102bd858"), "j" : 3 }
{ "_id" : ObjectId("4c220a42f3924d31102bd859"), "j" : 4 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85a"), "j" : 5 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85b"), "j" : 6 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85c"), "j" : 7 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85d"), "j" : 8 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85e"), "j" : 9 }
{ "_id" : ObjectId("4c220a42f3924d31102bd85f"), "j" : 10 }
{ "_id" : ObjectId("4c220a42f3924d31102bd860"), "j" : 11 }
{ "_id" : ObjectId("4c220a42f3924d31102bd861"), "j" : 12 }
{ "_id" : ObjectId("4c220a42f3924d31102bd862"), "j" : 13 }
{ "_id" : ObjectId("4c220a42f3924d31102bd863"), "j" : 14 }
{ "_id" : ObjectId("4c220a42f3924d31102bd864"), "j" : 15 }
{ "_id" : ObjectId("4c220a42f3924d31102bd865"), "j" : 16 }
{ "_id" : ObjectId("4c220a42f3924d31102bd866"), "j" : 17 }
{ "_id" : ObjectId("4c220a42f3924d31102bd867"), "j" : 18 }
{ "_id" : ObjectId("4c220a42f3924d31102bd868"), "j" : 19 }
{ "_id" : ObjectId("4c220a42f3924d31102bd869"), "j" : 20 }

Note that the "_id" field is always returned.

 

However, the findOne() method is both convenient and efficient:

> printjson(db.things.findOne({name:"mongo"}));
{ "_id" : ObjectId("4c2209f9f3924d31102bd84a"), "name" : "mongo" }

This is more efficient because the client requests a single object from the database, so less work is done by the database and the network. This is the equivalent of find({name:"mongo"}).limit(1).

 

 

This is highly recommended for performance reasons, as it limits the work the database does, and limits the amount of data returned over the network. For example:

> db.things.find().limit(3);
{ "_id" : ObjectId("4c2209f9f3924d31102bd84a"), "name" : "mongo" }
{ "_id" : ObjectId("4c2209fef3924d31102bd84b"), "x" : 3 }
{ "_id" : ObjectId("4c220a42f3924d31102bd856"), "x" : 4, "j" : 1 }
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章