Load data from MongoDB to Pandas DataFrame

MongoDB is a great document database for non-determined number or structure of fields yet separatable records.

We use it to store our intermediate level of data ( which is a little bit cleaner than the raw but not production ready) for model training. So there comes a question of how to load data from MongoDB collections to python pandas.DataFrame

Supposed your are using pymongo to read:

1
2
3
4
5
6
import pymongo
import pandas as pd
with pymongo.MongoClient('mongodb://user:password@localhost:port/mydb') as client:
collection = client.get_default_databasedb['my collection']
data = pd.DataFrame(list(collection.find()))

This is quick and easy for small collection, yet it creates problems when one is big, the list function read all data from the cursor iterator of dictionaries into memory. So a slightly modification could help to maintain a low memory consumption (trade off a bit of performance):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def iterator2dataframes(iterator, chunk_size: int):
"""Turn an iterator into multiple small pandas.DataFrame
This is a balance between memory and efficiency
"""
records = []
frames = []
for i, record in enumerate(iterator):
records.append(record)
if i % chunk_size == chunk_size - 1:
frames.append(pd.DataFrame(records))
records = []
if records:
frames.append(pd.DataFrame(records))
return pd.concat(frames) if frames else pd.DataFrame()
data = iterator2dataframe(collection.find(), 10000)

If by chance you need to generate multiple DataFrame rows for each mongodb document (additional support of progress bar with tqdm):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def iterator2dataframe(iterator,
chunk_size: int,
func=None,
**kwargs) -> pd.DataFrame:
"""Turn an Mongo iterator into multiple small pandas.DataFrame
This is a balance between memory and efficiency
If no result, return empty pandas.DataFrame
Args:
iterator: an iterator
chunk_size: the row size of each small pandas.DataFrame
func: generator to transform each record
kwargs: extra parameters passed to tqdm.tqdm
Returns:
pandas.DataFrame
"""
records = []
frames = []
for i, record in enumerate(tqdm(iterator, **kwargs)):
if func:
for new_record in func(record):
records.append(new_record)
else:
records.append(record)
if i % chunk_size == chunk_size - 1:
frames.append(pd.DataFrame(records))
records = []
if records:
frames.append(pd.DataFrame(records))
return pd.concat(frames) if frames else pd.DataFrame()
# func parameter example, must be a Generator
def record_result(record):
for interaction in record['xxx']:
yield {
'a': record['_id'],
'b': interaction['b']
}
data = iterator2dataframe(collection.find(), 1000, func=record_result)

In fact, pandas.DataFrame support reading a Generator(not Iterator) as input, however the memory management seems to be bad

1
2
# not recommended
pd.DataFrame(lambda x: yield x for x in collection.find())

Alternatively, if you are just dumping a collection, you could try odo’s mongo, haven’t really tried it yet, shall do a comparison in the future update.

Zhanzhao Deo Liang wechat
欢迎关注我的个人订阅号: deoXdeo
今天的午餐全赖有你支持!