如何将数据从mongodb导入pandas?
我需要分析MongoDB中的大量数据。 我如何将这些数据导入pandas?
我是pandas和numpy的新手。
编辑:mongodb集合包含标记date和时间的传感器值。 传感器值是浮点数据types。
样本数据:
{ "_cls" : "SensorReport", "_id" : ObjectId("515a963b78f6a035d9fa531b"), "_types" : [ "SensorReport" ], "Readings" : [ { "a" : 0.958069536790466, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:26:35.297Z"), "b" : 6.296118156595, "_cls" : "Reading" }, { "a" : 0.95574014778624, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:09.963Z"), "b" : 6.29651468650064, "_cls" : "Reading" }, { "a" : 0.953648289182713, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:27:37.545Z"), "b" : 7.29679823731148, "_cls" : "Reading" }, { "a" : 0.955931884300997, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:28:21.369Z"), "b" : 6.29642922525632, "_cls" : "Reading" }, { "a" : 0.95821381, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:20.801Z"), "b" : 7.28956613, "_cls" : "Reading" }, { "a" : 4.95821335, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:41:36.931Z"), "b" : 6.28956574, "_cls" : "Reading" }, { "a" : 9.95821341, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:42:09.971Z"), "b" : 0.28956488, "_cls" : "Reading" }, { "a" : 1.95667927, "_types" : [ "Reading" ], "ReadingUpdatedDate" : ISODate("2013-04-02T08:43:55.463Z"), "b" : 0.29115237, "_cls" : "Reading" } ], "latestReportTime" : ISODate("2013-04-02T08:43:55.463Z"), "sensorName" : "56847890-0", "reportCount" : 8 }
pymongo
可能会给你一个手,以下是我正在使用的一些代码:
import pandas as pd from pymongo import MongoClient def _connect_mongo(host, port, username, password, db): """ A util for making a connection to mongo """ if username and password: mongo_uri = 'mongodb://%s:%s@%s:%s/%s' % (username, password, host, port, db) conn = MongoClient(mongo_uri) else: conn = MongoClient(host, port) return conn[db] def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True): """ Read from Mongo and Store into DataFrame """ # Connect to MongoDB db = _connect_mongo(host=host, port=port, username=username, password=password, db=db) # Make a query to the specific DB and Collection cursor = db[collection].find(query) # Expand the cursor and construct the DataFrame df = pd.DataFrame(list(cursor)) # Delete the _id if no_id: del df['_id'] return df
Monary
这么做的,而且速度非常快 。 ( 另一个链接 )
看到这个很酷的post ,其中包括一个快速教程和一些时间。
您可以使用此代码将您的mongodb数据加载到pandas DataFrame。 这个对我有用。 希望你也是。
import pymongo import pandas as pd from pymongo import MongoClient client = MongoClient() db = client.database_name collection = db.collection_name data = pd.DataFrame(list(collection.find()))
import pandas as pd from odo import odo data = odo('mongodb://localhost/db::collection', pd.DataFrame)
http://docs.mongodb.org/manual/reference/mongoexport
导出到csv并使用read_csv
或JSON并使用DataFrame.from_records
为了有效地处理核外(不适合RAM)数据(即并行执行),你可以试试Python Blaze生态系统 :Blaze / Dask / Odo。
Blaze(和Odo )具有处理MongoDB的开箱即用function。
一些有用的文章开始:
- 引入Blaze Expessions (使用MongoDB查询示例)
- 重现它:Reddit字数
- Daskarrays和Blaze之间的区别
还有一篇文章展示了Blaze堆栈可能带来的惊人成就: 使用Blaze和Impala分析1.7亿个Reddit的注释 (本质上是在几秒钟内查询975GB的Reddit注释)。
PS我不隶属于这些技术。
根据PEP,简单比复杂好:
import pandas as pd df = pd.DataFrame.from_records(db.<database_name>.<collection_name>.find())
您可以像使用常规mongoDB数据库一样包含条件,甚至可以使用find_one()从数据库中只获取一个元素等。
瞧!
运用
pandas.DataFrame(list(...))
如果迭代器/生成器的结果很大,会消耗大量的内存
最好在最后生成小块和连续
def iterator2dataframes(iterator, chunk_size: int): """Turn an iterator into multiple small pandas.DataFrame This is a balance between memory and efficiency """ records = [] frames = [] for i, record in enumerate(iterator): records.append(record) if i % chunk_size == chunk_size - 1: frames.append(pd.DataFrame(records)) records = [] if records: frames.append(pd.DataFrame(records)) return pd.concat(frames)