为什么在哪里和select性能只是select？

我有一个class，就像这样：

public class MyClass { public int Value { get; set; } public bool IsValid { get; set; } }

_{实际上，它的规模要大得多，但是这又重现了这个问题（古怪）。}

我想获得实例有效的Value的总和。到目前为止，我已经find了两个解决scheme。

第一个是这样的：

 int result = myCollection.Where(mc => mc.IsValid).Select(mc => mc.Value).Sum();

然而，第二个是这样的：

 int result = myCollection.Select(mc => mc.IsValid ? mc.Value : 0).Sum();

我想要得到最有效的方法。起初，我认为第二个会更有效率。那么我的理论部分就开始了：“一个是O（n + m + m），另一个是O（n + n），第一个应该用更多的伤害执行，第二个应该更好less“。我认为他们会performance平等。编辑：然后@马丁指出，在哪里和select合并，所以它应该实际上是O（M + N）。但是，如果你看下面，这似乎是不相关的。

所以我把它做了testing。

_{（这是100多行，所以我认为最好把它作为一个主旨。）}
结果是…有趣的。

_{与0％的联系公差：}

这个尺度是有利于Select和Where ，约30点。

How much do you want to be the disambiguation percentage? 0 Starting benchmarking. Ties: 0 Where + Select: 65 Select: 36

_{与2％的领带宽容：}

这是一样的，除了一些他们在2％以内。我会说这是一个最小的误差范围。 Select和现在只有一个20点的领先。

How much do you want to be the disambiguation percentage? 2 Starting benchmarking. Ties: 6 Where + Select: 58 Select: 37

_{与5％的领带宽容：}

这就是我所说的最大误差。它使Select稍微好一点，但不是太多。

How much do you want to be the disambiguation percentage? 5 Starting benchmarking. Ties: 17 Where + Select: 53 Select: 31

_{有10％的联系公差：}

这是我的错误的边缘，但我仍然对结果感兴趣。因为它给了Select和Where二十点领先它已经有一段时间了。

How much do you want to be the disambiguation percentage? 10 Starting benchmarking. Ties: 36 Where + Select: 44 Select: 21

_{25％的领带宽度：}

这是我错误的方式，但我仍然对结果感兴趣，因为Select和Where 仍然（几乎）保持他们20点的领先优势。它似乎是在一个明显的less数几个，这就是给它的领先。

How much do you want to be the disambiguation percentage? 25 Starting benchmarking. Ties: 85 Where + Select: 16 Select: 0

现在，我猜测，20分的领先优势来自中线，他们都必须在相同的performance中取得领先。我可以尝试并logging下来，但这将是一个信息的完整载入。我想，图表会更好。

这就是我所做的。

选择vs选择和地点。

这表明， Select线保持稳定（预期）和Select + Where线上升（预期）。然而，让我感到困惑的是为什么它不符合50或更早的Select ：事实上，我期待早于50，因为必须为Select和Where创build额外的枚举器。我的意思是，这显示了20分的领先优势，但是这并不能解释为什么。我猜，这是我的问题的主要观点。

为什么它的行为如此呢？我应该相信吗？如果不是，我应该使用另一个还是这个？

正如@KingKong在评论中提到的那样，你也可以使用Sum的重载，它需要一个lambda。所以我的两个选项现在变成这样：

第一：

 int result = myCollection.Where(mc => mc.IsValid).Sum(mc => mc.Value);

第二：

 int result = myCollection.Sum(mc => mc.IsValid ? mc.Value : 0);

我会把它缩短一点，但是：

How much do you want to be the disambiguation percentage? 0 Starting benchmarking. Ties: 0 Where: 60 Sum: 41 How much do you want to be the disambiguation percentage? 2 Starting benchmarking. Ties: 8 Where: 55 Sum: 38 How much do you want to be the disambiguation percentage? 5 Starting benchmarking. Ties: 21 Where: 49 Sum: 31 How much do you want to be the disambiguation percentage? 10 Starting benchmarking. Ties: 39 Where: 41 Sum: 21 How much do you want to be the disambiguation percentage? 25 Starting benchmarking. Ties: 85 Where: 16 Sum: 0

二十分的领先地位仍然在那里，这意味着它不必和评论中的@Marcin指出的Where和Select组合在一起。

_{感谢您阅读我的文字墙！} _{另外，如果你有兴趣，这里是Excel的CSV格式的修改版本。}

在整个集合上Select迭代一次，并为每个项目执行条件分支（检查有效性）和+操作。

Where+Select创build一个跳过无效元素的迭代器（不会yield它们），仅在有效项目上执行+ 。

所以， Select的成本是：

t(s) = n * ( cost(check valid) + cost(+) )

而在Where+Select ：

t(ws) = n * ( cost(check valid) + p(valid) * (cost(yield) + cost(+)) )

哪里：

p(valid)是列表中的项目p(valid)的概率。
cost(check valid)是检查有效性的分支的成本
cost(yield)是构buildwhere迭代器的新状态的成本，它比Select版本使用的简单迭代器更复杂。

正如你所看到的，对于给定的n ， Select版本是一个常量，而Where+Select版本是一个以p(valid)作为variables的线性方程。成本的实际值决定了两条线的交点，由于cost(yield)可以不同于cost(+) ，它们不一定在p(valid) = 0.5时相交。

以下是对导致时序差异的深入解释。

IEnumerable<int>的Sum()函数如下所示：

 public static int Sum(this IEnumerable<int> source) { int sum = 0; foreach(int item in source) { sum += item; } return sum; }

在C＃中， foreach只是.Net版本的迭代器IEnumerator<T>语法糖（不要与IEnumerable<T>混淆） 。所以上面的代码实际上是这样翻译的：

 public static int Sum(this IEnumerable<int> source) { int sum = 0; IEnumerator<int> iterator = source.GetEnumerator(); while(iterator.MoveNext()) { int item = iterator.Current; sum += item; } return sum; }

请记住，你正在比较的两行代码如下所示

 int result1 = myCollection.Where(mc => mc.IsValid).Sum(mc => mc.Value); int result2 = myCollection.Sum(mc => mc.IsValid ? mc.Value : 0);

现在这里是踢球者：

LINQ使用延迟执行。因此，尽pipe可能看起来 result1迭代了两次， 但实际上它只迭代一次。 在Sum()期间，在MoveNext()的调用中实际应用Where()条件（这可能要归功于yield return的魔力） 。

这意味着，对于result1 ， while循环中的代码，

 { int item = iterator.Current; sum += item; }

对于mc.IsValid == true每个项目只执行一次。通过比较， result2将为集合中的每个项目执行该代码。这就是为什么result1通常更快。

（虽然，请注意，在MoveNext()中调用Where()条件仍然有一些小开销，所以如果大部分/所有项都有mc.IsValid == true ， result2实际上会更快！）

希望现在明白为什么result2通常比较慢。现在我想解释为什么我在评论中说这些LINQ性能比较并不重要 。

创build一个LINQexpression式是很便宜的。调用委托函数很便宜。分配和循环迭代器很便宜。但是不这样做更便宜。因此，如果你发现LINQ语句是你的程序中的瓶颈，那么根据我的经验，如果没有使用LINQ，它总会比任何LINQ方法更快。

所以，你的LINQ工作stream应该是这样的：

无处不在使用LINQ。
简介。
如果分析器说LINQ是造成瓶颈的原因，那么不用LINQ重写那段代码。

幸运的是，LINQ瓶颈是罕见的。哎呀，瓶颈是罕见的。在过去的几年中我写了数百条LINQ语句，最终取代<1％。而其中大部分是由于LINQ2EF糟糕的SQL优化，而不是LINQ的错误。

所以，像往常一样，首先写清楚明智的代码，然后等待，直到你被分析后，担心微观优化。

有趣的事情。你知道Sum(this IEnumerable<TSource> source, Func<TSource, int> selector)定义的？ 它使用Select方法！

 public static int Sum<TSource>(this IEnumerable<TSource> source, Func<TSource, int> selector) { return source.Select(selector).Sum(); }

所以实际上，这一切都应该工作几乎相同。我自己做了快速的研究，结果如下：

 Where -- mod: 1 result: 0, time: 371 ms WhereSelect -- mod: 1 result: 0, time: 356 ms Select -- mod: 1 result 0, time: 366 ms Sum -- mod: 1 result: 0, time: 363 ms ------------- Where -- mod: 2 result: 4999999, time: 469 ms WhereSelect -- mod: 2 result: 4999999, time: 429 ms Select -- mod: 2 result 4999999, time: 362 ms Sum -- mod: 2 result: 4999999, time: 358 ms ------------- Where -- mod: 3 result: 9999999, time: 441 ms WhereSelect -- mod: 3 result: 9999999, time: 452 ms Select -- mod: 3 result 9999999, time: 371 ms Sum -- mod: 3 result: 9999999, time: 380 ms ------------- Where -- mod: 4 result: 7500000, time: 571 ms WhereSelect -- mod: 4 result: 7500000, time: 501 ms Select -- mod: 4 result 7500000, time: 406 ms Sum -- mod: 4 result: 7500000, time: 397 ms ------------- Where -- mod: 5 result: 7999999, time: 490 ms WhereSelect -- mod: 5 result: 7999999, time: 477 ms Select -- mod: 5 result 7999999, time: 397 ms Sum -- mod: 5 result: 7999999, time: 394 ms ------------- Where -- mod: 6 result: 9999999, time: 488 ms WhereSelect -- mod: 6 result: 9999999, time: 480 ms Select -- mod: 6 result 9999999, time: 391 ms Sum -- mod: 6 result: 9999999, time: 387 ms ------------- Where -- mod: 7 result: 8571428, time: 489 ms WhereSelect -- mod: 7 result: 8571428, time: 486 ms Select -- mod: 7 result 8571428, time: 384 ms Sum -- mod: 7 result: 8571428, time: 381 ms ------------- Where -- mod: 8 result: 8749999, time: 494 ms WhereSelect -- mod: 8 result: 8749999, time: 488 ms Select -- mod: 8 result 8749999, time: 386 ms Sum -- mod: 8 result: 8749999, time: 373 ms ------------- Where -- mod: 9 result: 9999999, time: 497 ms WhereSelect -- mod: 9 result: 9999999, time: 494 ms Select -- mod: 9 result 9999999, time: 386 ms Sum -- mod: 9 result: 9999999, time: 371 ms

对于以下实现：

 result = source.Where(x => x.IsValid).Sum(x => x.Value); result = source.Select(x => x.IsValid ? x.Value : 0).Sum(); result = source.Sum(x => x.IsValid ? x.Value : 0); result = source.Where(x => x.IsValid).Select(x => x.Value).Sum();

mod表示：来自mod项目的每1个无效：对于mod == 1每个项目无效，对于mod == 2奇数项目无效等。集合包含10000000项目。

在这里输入图像说明

和100000000项目收集的结果：

在这里输入图像说明

正如你所看到的， Select和Sum结果在所有的mod值都是非常一致的。然而where和where select不是。

我的猜测是，与Where的版本筛选出0，他们不是总和的主题（即你不执行加法）。这当然是一个猜测，因为我不能解释如何执行额外的lambdaexpression式和调用多个方法胜过简单的加0。

我的一位朋友提出，由于溢出检查，总和中的0可能导致严重的性能损失。看看这将如何在未经检查的情况下执行将是有趣的。

运行下面的示例，我很清楚，Where + Select可以胜过Select的唯一时间实际上是在列表中潜在项目丢弃了一大笔钱（大约一半在我的非正式testing中）。在下面的小例子中，当从10mil跳过大约4mil的物品时，两个样本中的数字大致相同。我在发行版中运行，并重新执行where + select和select的结果。

 static void Main(string[] args) { int total = 10000000; Random r = new Random(); var list = Enumerable.Range(0, total).Select(i => r.Next(0, 5)).ToList(); for (int i = 0; i < 4000000; i++) list[i] = 10; var sw = new Stopwatch(); sw.Start(); int sum = 0; sum = list.Where(i => i < 10).Select(i => i).Sum(); sw.Stop(); Console.WriteLine(sw.ElapsedMilliseconds); sw.Reset(); sw.Start(); sum = list.Select(i => i).Sum(); sw.Stop(); Console.WriteLine(sw.ElapsedMilliseconds); }

如果你需要速度，只是做一个简单的循环可能是你最好的select。这样做往往比foreach更好（假设你的collections是随机访问当然）。

以下是我有10％的元素无效的时机：

 Where + Select + Sum: 257 Select + Sum: 253 foreach: 111 for: 61

而且有90％的无效元素：

 Where + Select + Sum: 177 Select + Sum: 247 foreach: 105 for: 58

这是我的基准代码

 public class MyClass { public int Value { get; set; } public bool IsValid { get; set; } } class Program { static void Main(string[] args) { const int count = 10000000; const int percentageInvalid = 90; var rnd = new Random(); var myCollection = new List<MyClass>(count); for (int i = 0; i < count; ++i) { myCollection.Add( new MyClass { Value = rnd.Next(0, 50), IsValid = rnd.Next(0, 100) > percentageInvalid } ); } var sw = new Stopwatch(); sw.Restart(); int result1 = myCollection.Where(mc => mc.IsValid).Select(mc => mc.Value).Sum(); sw.Stop(); Console.WriteLine("Where + Select + Sum:\t{0}", sw.ElapsedMilliseconds); sw.Restart(); int result2 = myCollection.Select(mc => mc.IsValid ? mc.Value : 0).Sum(); sw.Stop(); Console.WriteLine("Select + Sum:\t\t{0}", sw.ElapsedMilliseconds); Debug.Assert(result1 == result2); sw.Restart(); int result3 = 0; foreach (var mc in myCollection) { if (mc.IsValid) result3 += mc.Value; } sw.Stop(); Console.WriteLine("foreach:\t\t{0}", sw.ElapsedMilliseconds); Debug.Assert(result1 == result3); sw.Restart(); int result4 = 0; for (int i = 0; i < myCollection.Count; ++i) { var mc = myCollection[i]; if (mc.IsValid) result4 += mc.Value; } sw.Stop(); Console.WriteLine("for:\t\t\t{0}", sw.ElapsedMilliseconds); Debug.Assert(result1 == result4); } }

顺便说一下，我同意斯蒂尔加的猜测：你的两个案件的相对速度取决于无效项目的百分比，这仅仅是因为在“Where”情况下，工作总量需要做的量有所不同。

我觉得MarcinJuraszek的结果和It'NotALie的不同是有趣的。尤其是，Marcin Jurazek的结果始于同一地点的所有四个实现，而不是中间的结果。我将从源头上解释这是如何工作的。

让我们假设有n总元素和m有效元素。

Sum函数非常简单。它只是遍历枚举数： http : //typedescriptor.net/browse/members/367300-System.Linq.Enumerable.Sum (IEnumerable%601）

为了简单起见，我们假设集合是一个列表。 Select和WhereSelect都将创build一个WhereSelectListIterator 。这意味着生成的实际迭代器是相同的。在这两种情况下，都有一个循环遍历迭代器， WhereSelectListIterator 。迭代器最有趣的部分是MoveNext方法。

由于迭代器是相同的，循环是相同的。唯一的区别是在循环的主体。

这些lambda的身体有非常相似的费用。 where子句返回一个字段值，而三元谓词也返回一个字段值。 select子句返回一个字段值，而三元运算符的两个分支返回一个字段值或一个常量。组合的select子句将分支作为三元运算符，但WhereSelect使用MoveNext的分支。

但是，所有这些操作都相当便宜。到目前为止，最昂贵的操作是分支机构，那里的错误预测会花费我们。

另一个昂贵的操作是Invoke 。正如Branko Dimitrijevic所示，调用一个函数比添加一个值要花费更多的时间。

另外，在Sum中也检查了累计。如果处理器没有算术溢出标志，那么检查也是昂贵的。

因此，有趣的成本是：是：

（ n + m ）*调用+ m * checked+=
n *调用+ n * checked+=

因此，如果调用成本比检查累积成本高得多，则情况2总是更好。如果它们差不多，那么当大约一半的元素是有效的时候，我们将会看到平衡。

它看起来像在MarcinJuraszek的系统上，检查+ =具有微不足道的成本，但是在它不是和Branko Dimitrijevic的系统中，检查+ =具有显着的成本。看起来这是它的系统中最昂贵的，因为盈亏平衡点要高得多。看起来没有任何人从系统中发布结果，其中积累的成本远远超过了Invoke。

我不打算通过描述来解释，而是采取更多的math方法。

鉴于下面的代码应该与LINQ在内部做的近似，相对成本如下：
仅select： Nd + Na
其中+select： Nd + Md + Ma

为了弄清楚他们将要跨越的点，我们需要做一个小代数：
Nd + Md + Ma = Nd + Na => M(d + a) = Na => (M/N) = a/(d+a)

这意味着为了使拐点达到50％，委托调用的成本必须大体上与加法的成本相同。既然我们知道实际的拐点在60％左右，我们可以倒退，并确定委托调用@ It'sNotALie的成本实际上是一个令人吃惊的额外成本的2/3，但这就是他的数字说。

 static void Main(string[] args) { var set = Enumerable.Range(1, 10000000) .Select(i => new MyClass {Value = i, IsValid = i%2 == 0}) .ToList(); Func<MyClass, int> select = i => i.IsValid ? i.Value : 0; Console.WriteLine( Sum( // Cost: N additions Select(set, select))); // Cost: N delegate // Total cost: N * (delegate + addition) = Nd + Na Func<MyClass, bool> where = i => i.IsValid; Func<MyClass, int> wSelect = i => i.Value; Console.WriteLine( Sum( // Cost: M additions Select( // Cost: M delegate Where(set, where), // Cost: N delegate wSelect))); // Total cost: N * delegate + M * (delegate + addition) = Nd + Md + Ma } // Cost: N delegate calls static IEnumerable<T> Where<T>(IEnumerable<T> set, Func<T, bool> predicate) { foreach (var mc in set) { if (predicate(mc)) { yield return mc; } } } // Cost: N delegate calls static IEnumerable<int> Select<T>(IEnumerable<T> set, Func<T, int> selector) { foreach (var mc in set) { yield return selector(mc); } } // Cost: N additions static int Sum(IEnumerable<int> set) { unchecked { var sum = 0; foreach (var i in set) { sum += i; } return sum; } }

为什么在哪里和select性能只是select？

第一个是这样的：

然而，第二个是这样的：

所以我把它做了testing。

_{与0％的联系公差：}

_{与2％的领带宽容：}

_{与5％的领带宽容：}

_{有10％的联系公差：}

_{25％的领带宽度：}

为什么它的行为如此呢？我应该相信吗？如果不是，我应该使用另一个还是这个？

第一：

第二：

在UI中执行业务逻辑的unit testing数据库应用程序

我应该总是返回IEnumerable <T>而不是IList <T>？

使用entity framework6和SQLite的问题

如何尝试将string转换为Guid

如何实际执行五项规则？

如何将Linux可执行文件添加到.gitignore？

使用“自动”types演绎 – 如何找出编译器推断的types？

如何在string中写一个反斜杠（\）？

最好和最短的方式来评估mathexpression式

Windows服务模板缺失？

为什么在哪里和select性能只是select？

第一个是这样的：

然而，第二个是这样的：

所以我把它做了testing。

与0％的联系公差：

与2％的领带宽容：

与5％的领带宽容：

有10％的联系公差：

25％的领带宽度：

为什么它的行为如此呢？ 我应该相信吗？ 如果不是，我应该使用另一个还是这个？

第一：

第二：

在UI中执行业务逻辑的unit testing数据库应用程序

我应该总是返回IEnumerable <T>而不是IList <T>？

使用entity framework6和SQLite的问题

如何尝试将string转换为Guid

如何实际执行五项规则？

如何将Linux可执行文件添加到.gitignore？

使用“自动”types演绎 – 如何找出编译器推断的types？

如何在string中写一个反斜杠（\）？

最好和最短的方式来评估mathexpression式

Windows服务模板缺失？

_{与0％的联系公差：}

_{与2％的领带宽容：}

_{与5％的领带宽容：}

_{有10％的联系公差：}

_{25％的领带宽度：}

为什么它的行为如此呢？我应该相信吗？如果不是，我应该使用另一个还是这个？