以便携式数据格式保存/加载scipy sparse csr_matrix

如何以便携式格式保存/加载scipy sparse csr_matrix ？ scipy稀疏matrix是在Python 3（Windows 64位）上创build的，可以在Python 2（Linux 64位）上运行。最初，我使用pickle（协议= 2和fix_imports = True），但是从Python 3.2.2（Windows 64位）到Python 2.7.2（Windows 32位）不起作用，并且出现错误：

 TypeError: ('data type not understood', <built-in function _reconstruct>, (<type 'numpy.ndarray'>, (0,), '[98]')).

接下来，尝试了numpy.save和numpy.load以及scipy.io.mmwrite()和scipy.io.mmread() ，这些方法都没有工作。

从Scipy用户组得到了一个答案：

一个csr_matrix有三个数据属性： .data ， .indices和.indptr 。所有的都是简单的ndarrays，所以numpy.save会对它们起作用。用numpy.save或numpy.savez保存这三个数组，然后用numpy.load加载它们，然后用numpy.load重新创build稀疏matrix对象：
 new_csr = csr_matrix((data, indices, indptr), shape=(M, N)) 

举个例子：

 def save_sparse_csr(filename,array): np.savez(filename,data = array.data ,indices=array.indices, indptr =array.indptr, shape=array.shape ) def load_sparse_csr(filename): loader = np.load(filename) return csr_matrix(( loader['data'], loader['indices'], loader['indptr']), shape = loader['shape'])

虽然你写， scipy.io.mmwrite和scipy.io.mmread不适合你，我只是想补充他们的工作方式。这个问题是没有。 1谷歌命中，所以我自己开始使用np.savez和pickle.dump然后切换到简单明显的scipy函数。他们为我工作，不应该由那些还没有尝试过的人监督。

 from scipy import sparse, io m = sparse.csr_matrix([[0,0,0],[1,0,0],[0,1,0]]) m # <3x3 sparse matrix of type '<type 'numpy.int64'>' with 2 stored elements in Compressed Sparse Row format> io.mmwrite("test.mtx", m) del m newm = io.mmread("test.mtx") newm # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in COOrdinate format> newm.tocsr() # <3x3 sparse matrix of type '<type 'numpy.int32'>' with 2 stored elements in Compressed Sparse Row format> newm.toarray() # array([[0, 0, 0], [1, 0, 0], [0, 1, 0]], dtype=int32)

这里是使用Jupyter笔记本的三个最高的答案的性能比较。 input是一个密度为0.001的1M×100K随机稀疏matrix，包含100M的非零值：

 from scipy.sparse import random matrix = random(1000000, 100000, density=0.001, format='csr') matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in Compressed Sparse Row format>

`io.mmwrite` / `io.mmread`

 from scipy.sparse import io %time io.mmwrite('test_io.mtx', matrix) CPU times: user 4min 37s, sys: 2.37 s, total: 4min 39s Wall time: 4min 39s %time matrix = io.mmread('test_io.mtx') CPU times: user 2min 41s, sys: 1.63 s, total: 2min 43s Wall time: 2min 43s matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in COOrdinate format> Filesize: 3.0G.

（注意格式已经从csr更改为coo）。

`np.savez` / `np.load`

 import numpy as np from scipy.sparse import csr_matrix def save_sparse_csr(filename, array): # note that .npz extension is added automatically np.savez(filename, data=array.data, indices=array.indices, indptr=array.indptr, shape=array.shape) def load_sparse_csr(filename): # here we need to add .npz extension manually loader = np.load(filename + '.npz') return csr_matrix((loader['data'], loader['indices'], loader['indptr']), shape=loader['shape']) %time save_sparse_csr('test_savez', matrix) CPU times: user 1.26 s, sys: 1.48 s, total: 2.74 s Wall time: 2.74 s %time matrix = load_sparse_csr('test_savez') CPU times: user 1.18 s, sys: 548 ms, total: 1.73 s Wall time: 1.73 s matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in Compressed Sparse Row format> Filesize: 1.1G.

`cPickle`

 import cPickle as pickle def save_pickle(matrix, filename): with open(filename, 'wb') as outfile: pickle.dump(matrix, outfile, pickle.HIGHEST_PROTOCOL) def load_pickle(filename): with open(filename, 'rb') as infile: matrix = pickle.load(infile) return matrix %time save_pickle(matrix, 'test_pickle.mtx') CPU times: user 260 ms, sys: 888 ms, total: 1.15 s Wall time: 1.15 s %time matrix = load_pickle('test_pickle.mtx') CPU times: user 376 ms, sys: 988 ms, total: 1.36 s Wall time: 1.37 s matrix <1000000x100000 sparse matrix of type '<type 'numpy.float64'>' with 100000000 stored elements in Compressed Sparse Row format> Filesize: 1.1G.

注意：cPickle不适用于非常大的对象（请参阅此答案）。根据我的经验，对于270M非零值的2.7M x 50kmatrix并不适用。 np.savez解决scheme运行良好。

结论

（基于对CSRmatrix的简单testing） cPickle是最快的方法，但它不适用于非常大的matrix， np.savez只是略慢，而io.mmwrite慢得多，产生更大的文件并恢复到错误的格式。所以np.savez是这里的赢家。

假设你在两台机器上都有scipy，你可以使用pickle 。

但是，在酸洗numpy数组时，一定要指定一个二进制协议。否则，你会结束一个巨大的文件。

无论如何，你应该能够做到这一点：

 import cPickle as pickle import numpy as np import scipy.sparse # Just for testing, let's make a dense array and convert it to a csr_matrix x = np.random.random((10,10)) x = scipy.sparse.csr_matrix(x) with open('test_sparse_array.dat', 'wb') as outfile: pickle.dump(x, outfile, pickle.HIGHEST_PROTOCOL)

然后你可以加载它：

 import cPickle as pickle with open('test_sparse_array.dat', 'rb') as infile: x = pickle.load(infile)

现在您可以使用scipy.sparse.save_npz ： https ： scipy.sparse.save_npz

从scipy 0.19.0开始，你可以这样保存和加载稀疏matrix：

 from scipy import sparse data = sparse.csr_matrix((3, 4)) #Save sparse.save_npz('data_sparse.npz', data) #Load data = sparse.load_npz("data_sparse.npz")

这是我用来保存lil_matrix 。

 import numpy as np from scipy.sparse import lil_matrix def save_sparse_lil(filename, array): # use np.savez_compressed(..) for compression np.savez(filename, dtype=array.dtype.str, data=array.data, rows=array.rows, shape=array.shape) def load_sparse_lil(filename): loader = np.load(filename) result = lil_matrix(tuple(loader["shape"]), dtype=str(loader["dtype"])) result.data = loader["data"] result.rows = loader["rows"] return result

我必须说我发现NumPy的np.load（..） 非常慢 。这是我目前的解决scheme，我觉得运行速度更快：

 from scipy.sparse import lil_matrix import numpy as np import json def lil_matrix_to_dict(myarray): result = { "dtype": myarray.dtype.str, "shape": myarray.shape, "data": myarray.data, "rows": myarray.rows } return result def lil_matrix_from_dict(mydict): result = lil_matrix(tuple(mydict["shape"]), dtype=mydict["dtype"]) result.data = np.array(mydict["data"]) result.rows = np.array(mydict["rows"]) return result def load_lil_matrix(filename): result = None with open(filename, "r", encoding="utf-8") as infile: mydict = json.load(infile) result = lil_matrix_from_dict(mydict) return result def save_lil_matrix(filename, myarray): with open(filename, "w", encoding="utf-8") as outfile: mydict = lil_matrix_to_dict(myarray) json.dump(mydict, outfile)

我被要求以简单通用的格式发送matrix：

 <x,y,value>

我结束了这个：

 def save_sparse_matrix(m,filename): thefile = open(filename, 'w') nonZeros = np.array(m.nonzero()) for entry in range(nonZeros.shape[1]): thefile.write("%s,%s,%s\n" % (nonZeros[0, entry], nonZeros[1, entry], m[nonZeros[0, entry], nonZeros[1, entry]]))

以便携式数据格式保存/加载scipy sparse csr_matrix

`io.mmwrite` / `io.mmread`

`np.savez` / `np.load`

`cPickle`

结论

如何判断NumPy是否创build视图或副本？

Numpy修改数组的地方？

我应该使用scipy.pi，numpy.pi还是math.pi？

pythonnumpy机epsilon

将一个NumPy数组转换成一个csv文件

拆分（爆炸）pandas数据框string条目分隔行

如何将numpy.linalg.norm应用到matrix的每一行？

python numpy arange意外的结果

在numpy数组中find最接近的值

numpy数组的Python内存使用情况

以便携式数据格式保存/加载scipy sparse csr_matrix

io.mmwrite / io.mmread

np.savez / np.load

cPickle

结论

如何判断NumPy是否创build视图或副本？

Numpy修改数组的地方？

我应该使用scipy.pi，numpy.pi还是math.pi？

pythonnumpy机epsilon

将一个NumPy数组转换成一个csv文件

拆分（爆炸）pandas数据框string条目分隔行

如何将numpy.linalg.norm应用到matrix的每一行？

python numpy arange意外的结果

在numpy数组中find最接近的值

numpy数组的Python内存使用情况

`io.mmwrite` / `io.mmread`

`np.savez` / `np.load`

`cPickle`