Revulytics Blog

MongoDB Performance Pitfalls - Behind The Scenes

November 5, 2012


At Trackerbird, we have been using MongoDB to store and process real-time software analytics data for over 18 months. During this period, we have learnt a lot with regards to potential pitfalls when it comes to MongoDB performance. If you had to follow blog articles about big data and read some people's experiences on the topic, you might be led to believe that moving from RDBMS to NoSQL databases will solve ALL your performance issues, your database will magically start running tens of times faster, and you'll suddenly be able to start building real-time reports from raw data that you could only dream about when using traditional methods. Well, not so fast...

Migrating from RDBMS to Document Databases

First thing you need to understand when choosing MongoDB or any other document-oriented database over a traditional RDBMS is that they are totally different beasts when it comes to how data is stored and accessed. To start with, you have to forget everything you know about SQL as this is simply not used in MongoDB. You probably knew that already given such database systems are sometimes referred-to as NoSQL. Secondly, forget about normalizing data and joining the tables later on. In MongoDB there are no JOINs so you need to start thinking about documents and collections of documents instead of tables, records and fields. Thirdly, queries are as efficient as you make them. Don't expect the DBMS to optimize the query for you. You have to think about the most efficient method to build your queries... sometimes to the extent of specifying which index to use!

Memory Consumption

If you were thinking traditional databases are a memory hog, think again because you are about to enter a new world where no amount of RAM is ever enough! One thing that makes MongoDB the performer that it is, is the fact that it tries to put everything in RAM. It does this by mapping all the data into RAM and then leaving the task of managing the memory to the OS. The idea behind this is that a huge amount of effort has been done on memory managers for operating systems and that such amount effort cannot be easily replicated when developing a DBMS. That argument does have its merits, however the fact that MongoDB tries to put all its data in RAM means that you'll be needing lots of it! At the very least, you need enough RAM to hold all the database indexes because if you need to access the hard drive just to access the index, IO performance bottlenecks will simply trash your speed expectations. This is explained in further detail in the MongoDB manual.

Javascript Engine vs. Aggregation Framework vs. Other Methods

Earlier I mentioned that you must forget everything you know about SQL, JOINs, and other querying methods you are used to on relational databases. That doesn't mean you cannot do advanced queries, but these come at a cost. MongoDB can run very complex queries that are built in Javascript. This allows a lot of flexibility as you can do a lot of processing on the data while it's being queried. Speed, however, is a great issue here. Every record that is processed through the query (and record counts can reach millions) has to pass through a Javascript engine... and that takes time! Now it's true that such engines are improving, however there are simply too many overheads that can never be avoided. At one point, I myself built a very complex query generator that we used internally to generate complex Javascript queries from a much simpler query language.  The idea was to speed-up our development process with regards to new reports. Eventually however this had to be discarded as we simply couldn't optimize it enough. Then, recently, as of MongoDB version 2.2 came the Aggregation Framework. This is a great improvement over the Javascript engine when it comes to performance and allows for complex queries to run much faster. It is not as flexible as Javascript, but the speed difference is huge.

However, there is a third method which might sound strange, but which can be the fastest in some cases. That is forget about complex queries and instead run thousands of basic queries and then process the results on the client! Note however that this might not always work well, and it highly depends on your use case. It does have its uses however, and in some cases it can beat the Aggregation Framework hands down.

Finding the Bottleneck

For generating some complex reports, we needed to transfer large amounts of data from the database to the client for further processing. Our client (which is actually the backend which takes care of querying and report generation) is developed mainly in Python. What we found was that Python was wasting huge amounts of time getting data from MongoDB. This time was not being consumed by MongoDB itself. Instead, it was the MongoDB connector in Python which was encapsulating the data it receives into easy-to-use dictionaries. Such data representation is very convenient, but we needed speed above everything else. After exploring several options, we ended up using an alternative connector called Monary which instead of encapsulating into dictionaries, it relies on NumPy arrays. Now THAT was fast - in some cases reaching 50x faster data transfer rates than the official connector! This issue of slow connectors does not apply only to MongoDB. We also found out that even when using traditional databases such as MySQL, when transferring large amounts of data, we were seeing a 7x speed improvement when accessing the database using a low level API instead of the more user-friendly cursor method.

Finally, my last word is not to jump into conclusions too fast. Don't just believe what you read in a blog. Don't even believe my posts! If you're looking into improving your database performance you need to get your hands dirty. Get yourself a test server or at least a few virtual machines, install the DBMS's you're interested in, and do the benchmarks yourself. Also, when performing benchmarks, try to emulate your use case scenarios as closely as possible because the way you need to access your data can have a huge difference - sometimes enough to get you back to rethinking the whole strategy.

Post written by Clifford Farrugia