find曲线上的最佳折衷点
说我有一些数据,为此我想在它上面安装一个参数化模型。 我的目标是find这个模型参数的最佳值。
我正在使用AIC / BIC / MDLtypes的标准进行模型select,这种标准奖励低误差的模型,但也惩罚高复杂度的模型(我们正在寻求最简单但最令人信服的解释,可以这么说, 奥卡姆的剃刀 )。
接着上面,这是我得到的三种不同的标准(两个要最小化,一个要最大化)的例子:
在视觉上,你可以很容易地看到肘部的形状,你会为该地区的某个地方的参数select一个值。 问题是我正在做大量的实验,我需要一种方法来find这个值,而不需要干预。
我的第一个直觉就是试图从angular落画出一个45度angular的直线,并继续移动直到它与曲线相交,但是说起来容易做起来难。:)如果曲线有些偏斜,也可能会错过感兴趣的区域。
任何想法如何实现这个,或更好的想法?
以下是重现上述其中一个地块所需的样本:
curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344]; plot(1:100, curve)
编辑
我接受了乔纳斯给出的解决scheme。 基本上,对于曲线上的每个点p
,我们可以find最大距离为d
那个:
find肘部的快速方法是从曲线的第一个点到最后一个点绘制一条线,然后find距该线最远的数据点。
这当然有点依赖于你在线的平坦部分的点数,但如果你每次testing相同数量的参数,它应该出来合理的好。
curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344]; %# get coordinates of all the points nPoints = length(curve); allCoord = [1:nPoints;curve]'; %'# SO formatting %# pull out first point firstPoint = allCoord(1,:); %# get vector between first and last point - this is the line lineVec = allCoord(end,:) - firstPoint; %# normalize the line vector lineVecN = lineVec / sqrt(sum(lineVec.^2)); %# find the distance from each point to the line: %# vector between all points and first point vecFromFirst = bsxfun(@minus, allCoord, firstPoint); %# To calculate the distance to the line, we split vecFromFirst into two %# components, one that is parallel to the line and one that is perpendicular %# Then, we take the norm of the part that is perpendicular to the line and %# get the distance. %# We find the vector parallel to the line by projecting vecFromFirst onto %# the line. The perpendicular vector is vecFromFirst - vecFromFirstParallel %# We project vecFromFirst by taking the scalar product of the vector with %# the unit vector that points in the direction of the line (this gives us %# the length of the projection of vecFromFirst onto the line). If we %# multiply the scalar product by the unit vector, we have vecFromFirstParallel scalarProduct = dot(vecFromFirst, repmat(lineVecN,nPoints,1), 2); vecFromFirstParallel = scalarProduct * lineVecN; vecToLine = vecFromFirst - vecFromFirstParallel; %# distance to line is the norm of vecToLine distToLine = sqrt(sum(vecToLine.^2,2)); %# plot the distance to the line figure('Name','distance from curve to line'), plot(distToLine) %# now all you need is to find the maximum [maxDist,idxOfBestPoint] = max(distToLine); %# plot figure, plot(curve) hold on plot(allCoord(idxOfBestPoint,1), allCoord(idxOfBestPoint,2), 'or')
信息理论模型select的重点是它已经考虑了参数的数量。 因此,没有必要find一个肘,你只需要find最低限度。
find曲线的肘部只有使用合适时才有意义。 即使如此,您select的方法,select肘部在某种意义上设置参数数量的惩罚。 要select肘部,您需要最小化从原点到曲线的距离。 在距离计算中两个维度的相对加权将产生固有的惩罚项。 信息理论标准基于参数的数量和用于估计模型的数据样本的数量来设置该度量。
底线build议:使用BIC并取最小值。
首先,快速微积分评论:每个图的一阶导数f'
表示函数f
被绘制的速率在变化。 二阶导数f''
表示f'
变化的速率。 如果f''
很小,则意味着graphics以适度的速度改变方向。 但是,如果f''
很大,这意味着graphics正在迅速改变方向。
你想要在图的域上分离f''
最大的点。 这些将成为您的最佳模型的候选点。 你select的哪一点将取决于你,因为你没有具体指定你对健身和复杂性的重视程度。
如果有人需要上面Jonas发布的Matlab代码的Python版本。
import numpy as np curve = [8.4663, 8.3457, 5.4507, 5.3275, 4.8305, 4.7895, 4.6889, 4.6833, 4.6819, 4.6542, 4.6501, 4.6287, 4.6162, 4.585, 4.5535, 4.5134, 4.474, 4.4089, 4.3797, 4.3494, 4.3268, 4.3218, 4.3206, 4.3206, 4.3203, 4.2975, 4.2864, 4.2821, 4.2544, 4.2288, 4.2281, 4.2265, 4.2226, 4.2206, 4.2146, 4.2144, 4.2114, 4.1923, 4.19, 4.1894, 4.1785, 4.178, 4.1694, 4.1694, 4.1694, 4.1556, 4.1498, 4.1498, 4.1357, 4.1222, 4.1222, 4.1217, 4.1192, 4.1178, 4.1139, 4.1135, 4.1125, 4.1035, 4.1025, 4.1023, 4.0971, 4.0969, 4.0915, 4.0915, 4.0914, 4.0836, 4.0804, 4.0803, 4.0722, 4.065, 4.065, 4.0649, 4.0644, 4.0637, 4.0616, 4.0616, 4.061, 4.0572, 4.0563, 4.056, 4.0545, 4.0545, 4.0522, 4.0519, 4.0514, 4.0484, 4.0467, 4.0463, 4.0422, 4.0392, 4.0388, 4.0385, 4.0385, 4.0383, 4.038, 4.0379, 4.0375, 4.0364, 4.0353, 4.0344] nPoints = len(curve) allCoord = np.vstack((range(nPoints), curve)).T np.array([range(nPoints), curve]) firstPoint = allCoord[0] lineVec = allCoord[-1] - allCoord[0] lineVecNorm = lineVec / np.sqrt(np.sum(lineVec**2)) vecFromFirst = allCoord - firstPoint scalarProduct = np.sum(vecFromFirst * np.matlib.repmat(lineVecNorm, nPoints, 1), axis=1) vecFromFirstParallel = np.outer(scalarProduct, lineVecNorm) vecToLine = vecFromFirst - vecFromFirstParallel distToLine = np.sqrt(np.sum(vecToLine ** 2, axis=1)) idxOfBestPoint = np.argmax(distToLine)
所以解决这个问题的一种方法是用两条适合你的胳膊肘的路线。 但是由于在曲线的一个部分只有几个点(如我在评论中提到的那样),所以线拟合会受到一定的影响,除非您检测到哪些点被间隔开并在它们之间进行插值来制造更均匀的系列, 然后使用RANSACfind适合L的两条线 – 有点复杂但不是不可能的。
所以这里有一个更简单的解决scheme – 显然,由于MATLAB的缩放比例(显然),你放置的图表看起来就是这样做的。 所以我所做的只是使用比例尺信息来最小化从“原点”到你的点的距离。
请注意:原点估计可以大大改善,但是我会把它留给你。
代码如下:
%% Order curve = [8.4663 8.3457 5.4507 5.3275 4.8305 4.7895 4.6889 4.6833 4.6819 4.6542 4.6501 4.6287 4.6162 4.585 4.5535 4.5134 4.474 4.4089 4.3797 4.3494 4.3268 4.3218 4.3206 4.3206 4.3203 4.2975 4.2864 4.2821 4.2544 4.2288 4.2281 4.2265 4.2226 4.2206 4.2146 4.2144 4.2114 4.1923 4.19 4.1894 4.1785 4.178 4.1694 4.1694 4.1694 4.1556 4.1498 4.1498 4.1357 4.1222 4.1222 4.1217 4.1192 4.1178 4.1139 4.1135 4.1125 4.1035 4.1025 4.1023 4.0971 4.0969 4.0915 4.0915 4.0914 4.0836 4.0804 4.0803 4.0722 4.065 4.065 4.0649 4.0644 4.0637 4.0616 4.0616 4.061 4.0572 4.0563 4.056 4.0545 4.0545 4.0522 4.0519 4.0514 4.0484 4.0467 4.0463 4.0422 4.0392 4.0388 4.0385 4.0385 4.0383 4.038 4.0379 4.0375 4.0364 4.0353 4.0344]; x_axis = 1:numel(curve); points = [x_axis ; curve ]'; %' - SO formatting %% Get the scaling info f = figure(1); plot(points(:,1),points(:,2)); ticks = get(get(f,'CurrentAxes'),'YTickLabel'); ticks = str2num(ticks); aspect = get(get(f,'CurrentAxes'),'DataAspectRatio'); aspect = [aspect(2) aspect(1)]; close(f); %% Get the "origin" O = [x_axis(1) ticks(1)]; %% Scale the data - now the scaled values look like MATLAB''s idea of % what a good plot should look like scaled_O = O.*aspect; scaled_points = bsxfun(@times,points,aspect); %% Find the closest point del = sum((bsxfun(@minus,scaled_points,scaled_O).^2),2); [val ind] = min(del); best_ROC = [ind curve(ind)]; %% Display plot(x_axis,curve,'.-'); hold on; plot(O(1),O(2),'r*'); plot(best_ROC(1),best_ROC(2),'k*');
结果:
另外对于Fit(maximize)
曲线,您必须将其更改为[x_axis(1) ticks(end)]
。
用一种简单而直观的方式,我们可以这么说
如果我们从曲线的任意一点画出两条直线到两条曲线的终点,那么这两条直线所成的最小angular度就是所需点。
在这里,两条线可以被看成arm和肘点一样的点!
这是由Jonas在R中实现的解决scheme:
elbow_finder <- function(x_values, y_values) { # Max values to create line max_x_x <- max(x_values) max_x_y <- y_values[which.max(x_values)] max_y_y <- max(y_values) max_y_x <- x_values[which.max(y_values)] max_df <- data.frame(x = c(max_y_x, max_x_x), y = c(max_y_y, max_x_y)) # Creating straight line between the max values fit <- lm(max_df$y ~ max_df$x) # Distance from point to line distances <- c() for(i in 1:length(x_values)) { distances <- c(distances, abs(coef(fit)[2]*x_values[i] - y_values[i] + coef(fit)[1]) / sqrt(coef(fit)[2]^2 + 1^2)) } # Max distance point x_max_dist <- x_values[which.max(distances)] y_max_dist <- y_values[which.max(distances)] return(c(x_max_dist, y_max_dist)) }
双派生法。 但是,它似乎不适用于嘈杂的数据。 对于输出,您只需findd2的最大值来识别肘部。 这个实现在R.
elbow_finder <- function(x_values, y_values) { i_max <- length(x_values) - 1 # First and second derived first_derived <- list() second_derived <- list() # First derived for(i in 2:i_max){ slope1 <- (y_values[i+1] - y_values[i]) / (x_values[i+1] - x_values[i]) slope2 <- (y_values[i] - y_values[i-1]) / (x_values[i] - x_values[i-1]) slope_avg <- (slope1 + slope2) / 2 first_derived[[i]] <- slope_avg } first_derived[[1]] <- NA first_derived[[i_max+1]] <- NA first_derived <- unlist(first_derived) # Second derived for(i in 3:i_max-1){ d1 <- (first_derived[i+1] - first_derived[i]) / (x_values[i+1] - x_values[i]) d2 <- (first_derived[i] - first_derived[i-1]) / (x_values[i] - x_values[i-1]) d_avg <- (d1 + d2) / 2 second_derived[[i]] <- d_avg } second_derived[[1]] <- NA second_derived[[2]] <- NA second_derived[[i_max]] <- NA second_derived[[i_max+1]] <- NA second_derived <- unlist(second_derived) return(list(d1 = first_derived, d2 = second_derived)) }