MongoDB is a great document database for non-determined number or structure of fields yet separatable records.
We use it to store our intermediate level of data ( which is a little bit cleaner than the raw but not production ready) for model training. So there comes a question of how to load data from MongoDB collections to python pandas.DataFrame
Supposed your are using pymongo to read:
This is quick and easy for small collection, yet it creates problems when one is big, the list function read all data from the cursor iterator of dictionaries into memory. So a slightly modification could help to maintain a low memory consumption (trade off a bit of performance):
If by chance you need to generate multiple DataFrame rows for each mongodb document (additional support of progress bar with tqdm):
In fact, pandas.DataFrame support reading a Generator(not Iterator) as input, however the memory management seems to be bad
Alternatively, if you are just dumping a collection, you could try odo’s mongo, haven’t really tried it yet, shall do a comparison in the future update.