Python中Pearson相关性及其意义的计算
我正在寻找一个以input两个列表为参数的函数,并返回Pearson相关性和相关性的显着性。
你可以看看scipy.stats
:
from pydoc import help from scipy.stats.stats import pearsonr help(pearsonr) >>> Help on function pearsonr in module scipy.stats.stats: pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-correlation. The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson's correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so. Parameters ---------- x : 1D array y : 1D array the same length as x Returns ------- (Pearson's correlation coefficient, 2-tailed p-value) References ---------- http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation
Pearson相关性可以用numpy的corrcoef
来计算。
import numpy numpy.corrcoef(list1, list2)[0, 1]
如果你不喜欢安装scipy,我已经使用了这个快速入门,稍微修改了编程集体智慧 :
(为正确编辑)
from itertools import imap def pearsonr(x, y): # Assume len(x) == len(y) n = len(x) sum_x = float(sum(x)) sum_y = float(sum(y)) sum_x_sq = sum(map(lambda x: pow(x, 2), x)) sum_y_sq = sum(map(lambda x: pow(x, 2), y)) psum = sum(imap(lambda x, y: x * y, x, y)) num = psum - (sum_x * sum_y/n) den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5) if den == 0: return 0 return num / den
另一种select可以是来自linregress的本地scipy函数,它计算:
斜率:回归线的斜率
截距:回归线的截距
r值:相关系数
p值:假设检验的双侧p值,零假设是斜率为零
stderr:估算的标准误差
这里是一个例子:
a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3] b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15] from scipy.stats import linregress linregress(a, b)
会回报你:
LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)
以下代码是定义的直接解释:
import math def average(x): assert len(x) > 0 return float(sum(x)) / len(x) def pearson_def(x, y): assert len(x) == len(y) n = len(x) assert n > 0 avg_x = average(x) avg_y = average(y) diffprod = 0 xdiff2 = 0 ydiff2 = 0 for idx in range(n): xdiff = x[idx] - avg_x ydiff = y[idx] - avg_y diffprod += xdiff * ydiff xdiff2 += xdiff * xdiff ydiff2 += ydiff * ydiff return diffprod / math.sqrt(xdiff2 * ydiff2)
testing:
print pearson_def([1,2,3], [1,5,7])
回报
0.981980506062
这与Excel, 这个计算器 , SciPy (也是NumPy )一致,分别返回0.981980506和0.9819805060619657和0.98198050606196574。
R :
> cor( c(1,2,3), c(1,5,7)) [1] 0.9819805
编辑 :修正了一个评论者指出的错误。
我认为我的答案应该是最简单的编码和理解计算Pearson相关系数(PCC) 的步骤 ,而不是依赖于numpy / scipy。
import math # calculates the mean def mean(x): sum = 0.0 for i in x: sum += i return sum / len(x) # calculates the sample standard deviation def sampleStandardDeviation(x): sumv = 0.0 for i in x: sumv += (i - mean(x))**2 return math.sqrt(sumv/(len(x)-1)) # calculates the PCC using both the 2 functions above def pearson(x,y): scorex = [] scorey = [] for i in x: scorex.append((i - mean(x))/sampleStandardDeviation(x)) for j in y: scorey.append((j - mean(y))/sampleStandardDeviation(y)) # multiplies both lists together into 1 list (hence zip) and sums the whole list return (sum([i*j for i,j in zip(scorex,scorey)]))/(len(x)-1)
PCC的意义基本上是向你展示两个variables/列表之间的强相关性 。 值得注意的是,PCC值范围从-1到1 。 0到1之间的值表示正相关。 0的值=最高的变化(不相关)。 -1到0之间的值表示负相关。
嗯,这些反应很多很长很难阅读代码…
我build议在使用数组时使用numpy的漂亮function:
import numpy as np def pcc(X, Y): ''' Compute Pearson Correlation Coefficient. ''' # Normalise X and Y X -= X.mean(0) Y -= Y.mean(0) # Standardise X and Y X /= X.std(0) Y /= Y.std(0) # Compute mean product return np.mean(X*Y) # Using it on a random example from random import random X = np.array([random() for x in xrange(100)]) Y = np.array([random() for x in xrange(100)]) pcc(X, Y)
你也可以用pandas.DataFrame.corr
来做到这一点:
import pandas as pd a = [[1, 2, 3], [5, 6, 9], [5, 6, 11], [5, 6, 13], [5, 3, 13]] df = pd.DataFrame(data=a) df.corr()
这给了
0 1 2 0 1.000000 0.745601 0.916579 1 0.745601 1.000000 0.544248 2 0.916579 0.544248 1.000000
这里是基于稀疏向量的皮尔森相关的实现。 这里的向量表示为(index,value)表示的元组列表。 这两个稀疏vector可以有不同的长度,但是在所有的vector大小必须是相同的。 这对于文本挖掘应用程序是非常有用的,因为大多数特征是包含单词的vector大小非常大,因此通常使用稀疏vector执行计算。
def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0): indexed_feature_dict = {} if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0: raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation") sum_a = sum(value for index, value in first_feature_vector) sum_b = sum(value for index, value in second_feature_vector) avg_a = float(sum_a) / length_of_featureset avg_b = float(sum_b) / length_of_featureset mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + (( length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2))) mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + (( length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2))) covariance_a_b = 0 #calculate covariance for the sparse vectors for tuple in first_feature_vector: if len(tuple) != 2: raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,) indexed_feature_dict[tuple[0]] = tuple[1] count_of_features = 0 for tuple in second_feature_vector: count_of_features += 1 if len(tuple) != 2: raise ValueError("Invalid feature frequency tuple in featureVector: %s") % (tuple,) if tuple[0] in indexed_feature_dict: covariance_a_b += ((indexed_feature_dict[tuple[0]] - avg_a) * (tuple[1] - avg_b)) del (indexed_feature_dict[tuple[0]]) else: covariance_a_b += (0 - avg_a) * (tuple[1] - avg_b) for index in indexed_feature_dict: count_of_features += 1 covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b) #adjust covariance with rest of vector with 0 value covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b if mean_sq_error_a == 0 or mean_sq_error_b == 0: return -1 else: return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)
unit testing:
def test_get_get_pearson_corelation(self): vector_a = [(1, 1), (2, 2), (3, 3)] vector_b = [(1, 1), (2, 5), (3, 7)] self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None) vector_a = [(1, 1), (2, 2), (3, 3)] vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)] self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)
这是一个mkh的答案,运行速度比它更快的变种,scipy.stats.pearsonr,使用Numba。
import numba @numba.jit def corr(data1, data2): M = data1.size sum1 = 0. sum2 = 0. for i in range(M): sum1 += data1[i] sum2 += data2[i] mean1 = sum1 / M mean2 = sum2 / M var_sum1 = 0. var_sum2 = 0. cross_sum = 0. for i in range(M): var_sum1 += (data1[i] - mean1) ** 2 var_sum2 += (data2[i] - mean2) ** 2 cross_sum += (data1[i] * data2[i]) std1 = (var_sum1 / M) ** .5 std2 = (var_sum2 / M) ** .5 cross_mean = cross_sum / M return (cross_mean - mean1 * mean2) / (std1 * std2)
这是一个使用numpy的Pearson相关函数的实现:
def corr(data1, data2): "data1 & data2 should be numpy arrays." mean1 = data1.mean() mean2 = data2.mean() std1 = data1.std() std2 = data2.std() # corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2) corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2) return corr
def corr(data1, data2): "data1 & data2 should be numpy arrays." mean1 = data1.mean() mean2 = data2.mean() std1 = data1.std() std2 = data2.std() # corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2) corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2) return corr
你可以看看这篇文章。 这是一个logging良好的例子,它使用pandas图书馆(对于Python)根据来自多个文件的历史外汇货币对数据计算相关性,然后使用seaborn库生成热图图。
http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/
你可能想知道如何在寻找特定方向的相关性(负相关或正相关)的情况下解释你的概率。下面是我写的一个函数来帮助解决这个问题。 这可能是对的!
# Given (possibly random) variables, X and Y, and a correlation direction, # returns: # (r, p), # where r is the Pearson correlation coefficient, and p is the probability # that there is no correlation in the given direction. # # direction: # if positive, p is the probability that there is no positive correlation in # the population sampled by X and Y # if negative, p is the probability that there is no negative correlation # if 0, p is the probability that there is no correlation in either direction def probabilityNotCorrelated(X, Y, direction=0): x = len(X) if x != len(Y): raise ValueError("variables not same len: " + str(x) + ", and " + \ str(len(Y))) if x < 6: raise ValueError("must have at least 6 samples, but have " + str(x)) (corr, prb_2_tail) = stats.pearsonr(X, Y) if not direction: return (corr, prb_2_tail) prb_1_tail = prb_2_tail / 2 if corr * direction > 0: return (corr, prb_1_tail) return (corr, 1 - prb_1_tail)