一维观测数据中exception值检测的Pythonic方法
对于给定的数据,我想设置离群值(由95%confidense级别或95%分位数函数或任何需要的东西)定义为nan值。 以下是我现在使用的数据和代码。 如果有人能够进一步解释我,我会很高兴。
import numpy as np, matplotlib.pyplot as plt data = np.random.rand(1000)+5.0 plt.plot(data) plt.xlabel('observation number') plt.ylabel('recorded value') plt.show()
使用percentile
的问题是,标识为exception值的点是您样本大小的函数。
testingexception值的方法有很多,您应该仔细考虑如何对它们进行分类。 理想情况下,你应该使用先验信息(例如“任何高于/低于这个值是不现实的,因为…”)
然而,一个常见的,不太不合理的离群testing是根据“中位数绝对偏差”来删除分数。
下面是一个N维情况的实现(从这里的一些论文代码: https : //github.com/joferkington/oost_paper_code/blob/master/utilities.py ):
def is_outlier(points, thresh=3.5): """ Returns a boolean array with True if points are outliers and False otherwise. Parameters: ----------- points : An numobservations by numdimensions array of observations thresh : The modified z-score to use as a threshold. Observations with a modified z-score (based on the median absolute deviation) greater than this value will be classified as outliers. Returns: -------- mask : A numobservations-length boolean array. References: ---------- Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and Handle Outliers", The ASQC Basic References in Quality Control: Statistical Techniques, Edward F. Mykytka, Ph.D., Editor. """ if len(points.shape) == 1: points = points[:,None] median = np.median(points, axis=0) diff = np.sum((points - median)**2, axis=-1) diff = np.sqrt(diff) med_abs_deviation = np.median(diff) modified_z_score = 0.6745 * diff / med_abs_deviation return modified_z_score > thresh
这与我以前的答案非常相似,但我想详细说明样本大小效应。
让我们比较一个基于百分位数的exception值检验(类似于@ CTZhu的回答)和各种不同样本量的中值绝对偏差(MAD)检验:
import numpy as np import matplotlib.pyplot as plt import seaborn as sns def main(): for num in [10, 50, 100, 1000]: # Generate some data x = np.random.normal(0, 0.5, num-3) # Add three outliers... x = np.r_[x, -3, -10, 12] plot(x) plt.show() def mad_based_outlier(points, thresh=3.5): if len(points.shape) == 1: points = points[:,None] median = np.median(points, axis=0) diff = np.sum((points - median)**2, axis=-1) diff = np.sqrt(diff) med_abs_deviation = np.median(diff) modified_z_score = 0.6745 * diff / med_abs_deviation return modified_z_score > thresh def percentile_based_outlier(data, threshold=95): diff = (100 - threshold) / 2.0 minval, maxval = np.percentile(data, [diff, 100 - diff]) return (data < minval) | (data > maxval) def plot(x): fig, axes = plt.subplots(nrows=2) for ax, func in zip(axes, [percentile_based_outlier, mad_based_outlier]): sns.distplot(x, ax=ax, rug=True, hist=False) outliers = x[func(x)] ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False) kwargs = dict(y=0.95, x=0.05, ha='left', va='top') axes[0].set_title('Percentile-based Outliers', **kwargs) axes[1].set_title('MAD-based Outliers', **kwargs) fig.suptitle('Comparing Outlier Tests with n={}'.format(len(x)), size=14) main()
请注意,无论样本大小如何,基于MAD的分类器都能正常工作,而基于百分位的分类器将更多的点分类为样本量越大,无论它们是否实际是exception值。
我已经调整了http://eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers的代码,它给出了与Joe Kington相同的结果,但是使用了L1距离而不是L2距离,支持不对称分配。 原来的R代码没有Joe的0.6745乘数,所以我也在这个线程中join了一致性。 不是100%确定是否有必要,而是将苹果与苹果进行比较。
def doubleMADsfromMedian(y,thresh=3.5): # warning: this function does not check for NAs # nor does it address issues when # more than 50% of your data have identical values m = np.median(y) abs_dev = np.abs(y - m) left_mad = np.median(abs_dev[y <= m]) right_mad = np.median(abs_dev[y >= m]) y_mad = left_mad * np.ones(len(y)) y_mad[y > m] = right_mad modified_z_score = 0.6745 * abs_dev / y_mad modified_z_score[y == m] = 0 return modified_z_score > thresh
一维数据中exception值的检测取决于其分布
1- 正态分布 :
- 数据值在预期范围内几乎是均匀分布的:在这种情况下,您可以轻松使用所有包含均值的方法,例如对于正态分布数据(中心极限值),相应的3或2个标准偏差(95%或99.7%)的置信区间定理和样本均值的抽样分布)。我是一个非常有效的方法。 可汗学院统计和概率 – 抽样分布图书馆的解释。
另一种方法是预测区间,如果你想要数据点的置信区间而不是平均值。
-
数据值是在一个范围内随机分布的 :平均值可能不是一个公平的数据表示,因为平均值很容易受到exception值的影响(数据集中非常小或很大的值不是典型值)中值是另一种测量数字数据集的中心。
中值绝对偏差 – 一种测量所有点距离中位数距离的方法http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm – 具有很好的解释乔·金顿在上面的回答中解释道
2 – 对称分布 :如果z分数计算和阈值相应地改变,那么再次中值绝对偏差是一个好方法
说明: http : //eurekastatistics.com/using-the-median-absolute-deviation-to-find-outliers/
3 – 不对称分布:双MAD – 双中值绝对偏差解释在上面的链接
附上我的Python代码以供参考:
def is_outlier_doubleMAD(self,points): """ FOR ASSYMMETRIC DISTRIBUTION Returns : filtered array excluding the outliers Parameters : the actual data Points array Calculates median to divide data into 2 halves.(skew conditions handled) Then those two halves are treated as separate data with calculation same as for symmetric distribution.(first answer) Only difference being , the thresholds are now the median distance of the right and left median with the actual data median """ if len(points.shape) == 1: points = points[:,None] median = np.median(points, axis=0) medianIndex = (points.size/2) leftData = np.copy(points[0:medianIndex]) rightData = np.copy(points[medianIndex:points.size]) median1 = np.median(leftData, axis=0) diff1 = np.sum((leftData - median1)**2, axis=-1) diff1 = np.sqrt(diff1) median2 = np.median(rightData, axis=0) diff2 = np.sum((rightData - median2)**2, axis=-1) diff2 = np.sqrt(diff2) med_abs_deviation1 = max(np.median(diff1),0.000001) med_abs_deviation2 = max(np.median(diff2),0.000001) threshold1 = ((median-median1)/med_abs_deviation1)*3 threshold2 = ((median2-median)/med_abs_deviation2)*3 #if any threshold is 0 -> no outliers if threshold1==0: threshold1 = sys.maxint if threshold2==0: threshold2 = sys.maxint #multiplied by a factor so that only the outermost points are removed modified_z_score1 = 0.6745 * diff1 / med_abs_deviation1 modified_z_score2 = 0.6745 * diff2 / med_abs_deviation2 filtered1 = [] i = 0 for data in modified_z_score1: if data < threshold1: filtered1.append(leftData[i]) i += 1 i = 0 filtered2 = [] for data in modified_z_score2: if data < threshold2: filtered2.append(rightData[i]) i += 1 filtered = filtered1 + filtered2 return filtered
使用np.percentile
作为@Martinbuild议:
In [33]: P=np.percentile(A, [2.5, 97.5]) In [34]: A[(P[0]<A)&(P[1]>A)] #or =>, <= for within 95% A[(P[0]>A)|(P[1]<A)]=np.nan #to set the outliners to np.nan
那么一个简单的解决scheme也可以,去除超出2个标准偏差(或1.96)的东西:
def outliers(tmp): """tmp is a list of numbers""" outs = [] mean = sum(tmp)/(1.0*len(tmp)) var = sum((tmp[i] - mean)**2 for i in range(0, len(tmp)))/(1.0*len(tmp)) std = var**0.5 outs = [tmp[i] for i in xrange(0, len(tmp)) if abs(tmp[i]-mean) > 1.96*std] return outs lst = [random.randrange(-10, 55) for _ in range(40)] print lst print outliers(lst)