Tuesday, October 25, 2011

Getting Started with MongoDB and Python

I'm going to over some basic steps to get a feeling of MongoDB. I will be using Python with Pymongo module to interact with MongoDB.  I've gone over the installation of these two in a previous post. Any way, let's get to it.


MongoDB (from "humongous") is one of the so called NoSQL databases.  It is document-based, schema-free, has no joints, and it supports indexing and  adhoc queries.  For those of us that are used to RDBMS systems, these document-based data stores are easier to understand and work with when making the transition into the NoSQL world.
Let's go over some concepts. MongoDB has the same "database" or "schema" concept as RDBMS. A database can have zero or more collections (a collection is similar to a table in RDBMS), a collection can have zero or more documents (a document is similar to a row in RDBMS), and a documents can have zero or more key-value pair fields (fields are similar to columns in RDBMS).  One of the characteristics of document-based data stares is that they are schema-less, practically they are not strict about what goes in a collection. Essentially, a collection can have documents that are completely different from each other and that is all fine.
MongoDB uses BSON (not exclusive to MongoDB) as the data storage and network transfer for documents. BSON documents are very much JSON with extra support for Boolean, integer, timestamp and other data types. We can query for these documents either by using the native MongoDB queries or by crafting more advanced queries using mapReduce.
Let's summarize and look at some other key features of MondoDB (based on the main stie mogodb_home)
  • A NoSQL data stare
    • document based.
  • BSON documents are a binary-encoding serialization of JSON
    • Language independent data interchange format – not exclusive to MongDB
    • Supports
      • Boolean, integer, float, date, string and binary types.
  • Protocol: programming language specific
  • Document-based query language
    • can leverage on defined indexes.
  • GridFS links
    • GridFS is a storage specification for large objects.
      • Video, images, etc.
  • Support indexing much like relation databases
    • including secondary and complex indexes.
    • Indexes are implemented as B-Trees indexes.
    • Indexes are used by Mongo's query optimizer to quickly sort through and order the documents in a collection.
Ok, enough of terminology and lets write some code :)  As I said before I'll be using Python with Pymongo.
1.  I'll start my mongodb server
2.  I'll open my Python interactive shell.
3.  I'm creating a connection to MongoDB.  I've explicitly "selected" a database ("newsdb") and a collection ("articles"), even if they don't yet exist in MondoDB it will work fine. Whenever we try to insert a document, Mongo will check to see if we have these two defined, if they are it will used them, if they are not Mongo will create them.
>>>  import pymongo
>>>  from pymongo import Connection
>>>  connection = Connection('localhost', 27017) 
>>>  db = connection.newsdb
>>>  articles = db.articles


4.  Le'ts create an article
>>>  article = {"title": "some title", "desc": "some desc", "author": "jane"}
5.  Insert document - notice the auto-generated id created by MongoDB
>>>  articles.insert(article)
ObjectId('4ea75b857041ef105c000000')


6.  Collection was created automatically
>>>  db.collection_names()
[u'articles', u'system.indexes']


7.  Checking for the newly created document(notice the criteria is just a JSON doc itself)
>>>  articles.find_one({"title": "some title"})
{u'title': u'some title', u'_id': ObjectId('4ea75b857041ef105c000000'), 
  u'author': u'jane', u'desc': u'some desc'}


8. Lets create a new document with different schema 
>>>  import datetime
>>>  article = {"title": "short title", "desc": "a short desc", "author": "abdel", 
                     "date" :datetime.datetime.utcnow()}
>>> articles.insert(article)
ObjectId('4ea765ac7041ef105c000001')


8. We can have embedded docs
>>>  article = {"title": "petite title", "desc": "", "author": "abdel", 
                        "comments": [{"user": "mino", "comment": "I agree"}]}
>>>  articles.insert(article)
ObjectId('4ea766677041ef105c000002')


9.  We can inset articles in bulks 
>>>  li_articles = [{"title": "another title ", "desc": "another desc"}, 
                            {"title": "yet another tile", "desc": "yet another desc",  "author": "jane"}]


10.  Lets iterate over all our documents 
>>>  for article in articles.find():  article
11. Get the count
>>>  articles.count()
8
12.  Let's find the count of docs that match a specific query
>>>  articles.find({"author": "abdel"}).count()
2
11. Let's update a document
>>>  articles.update({"title":"some title"}, {"$set": {"desc": "updated some desc"}})
>>>  articles.find_one({"title":"some title"})
{u'title': u'some title', u'desc': u'updated some desc', 
u'_id': ObjectId('4ea75b857041ef105c000000'), u'author': u'jane'}


12. Let's delete/remove a document
>>>  articles.remove({"title":"some title"})


I just scratched the surface, for more information check out the Mono documentation. There we can fine a nice description of Mongo's queries and how they compare to SQL queries. It also has a good explanation for how and when to use more advance queries using MapReduce. This is useful for processing batches of data and for doing data aggregation operations.