在.NET中有效地合并string数组，保持不同的值

我正在使用.NET 3.5。我有两个string数组，可能共享一个或多个值：

string[] list1 = new string[] { "apple", "orange", "banana" }; string[] list2 = new string[] { "banana", "pear", "grape" };

我想要一种方法将它们合并成一个没有重复值的数组：

 { "apple", "orange", "banana", "pear", "grape" }

我可以用LINQ来做到这一点：

 string[] result = list1.Concat(list2).Distinct().ToArray();

但我想这对于大型数组来说效率不高。

有没有更好的办法？

 string[] result = list1.Union(list2).ToArray();

msdn ：“这种方法排除了返回集的重复，这与Concat（TSource）方法有所不同，它会返回input序列中的所有元素，包括重复项。

你为什么想象效率低下？据我所知，Concat和Distinct都是懒惰的评估，在后台使用一个HashSet来跟踪已经返回的元素。

我不知道你如何设法使它比一般的方式更有效率:)

编辑：不同的实际使用集（内部类）而不是HashSet，但要点仍然是正确的。这是一个很好的例子，说明LINQ是多么的简洁。最简单的答案几乎是没有更多的领域知识，你可以实现的效率。

效果相当于：

 public static IEnumerable<T> DistinctConcat<T>(IEnumerable<T> first, IEnumerable<T> second) { HashSet<T> returned = new HashSet<T>(); foreach (T element in first) { if (returned.Add(element)) { yield return element; } } foreach (T element in second) { if (returned.Add(element)) { yield return element; } } }

.NET 3.5引入了HashSet类，可以这样做：

 IEnumerable<string> mergedDistinctList = new HashSet<string>(list1).Union(list2);

不确定的performance，但它应该击败你给的Linq例子。

编辑：我站在纠正。 Concat和Distinct的懒惰实现具有关键的内存和速度优势。 Concat / Distinct的速度提高了大约10％，并保存了多个数据副本。

我通过代码确认：

 Setting up arrays of 3000000 strings overlapping by 300000 Starting Hashset... HashSet: 00:00:02.8237616 Starting Concat/Distinct... Concat/Distinct: 00:00:02.5629681

是以下的输出：

  int num = 3000000; int num10Pct = (int)(num / 10); Console.WriteLine(String.Format("Setting up arrays of {0} strings overlapping by {1}", num, num10Pct)); string[] list1 = Enumerable.Range(1, num).Select((a) => a.ToString()).ToArray(); string[] list2 = Enumerable.Range(num - num10Pct, num + num10Pct).Select((a) => a.ToString()).ToArray(); Console.WriteLine("Starting Hashset..."); Stopwatch sw = new Stopwatch(); sw.Start(); string[] merged = new HashSet<string>(list1).Union(list2).ToArray(); sw.Stop(); Console.WriteLine("HashSet: " + sw.Elapsed); Console.WriteLine("Starting Concat/Distinct..."); sw.Reset(); sw.Start(); string[] merged2 = list1.Concat(list2).Distinct().ToArray(); sw.Stop(); Console.WriteLine("Concat/Distinct: " + sw.Elapsed);

免责声明这是过早的优化。对于您的示例数组，请使用3.5扩展方法。直到你知道你在这个地区有一个性能问题，你应该使用库代码。

如果您可以对数组进行sorting，或者当您到达代码中的那一点时对它们进行sorting，则可以使用以下方法。

这些将从两者中拉出一个项目，并产生“最低”项目，然后从相应的源获取新的项目，直到两个源都耗尽。在从两个源获取的当前项相等的情况下，将从第一个源产生一个，并在两个源中跳过它们。

 private static IEnumerable<T> Merge<T>(IEnumerable<T> source1, IEnumerable<T> source2) { return Merge(source1, source2, Comparer<T>.Default); } private static IEnumerable<T> Merge<T>(IEnumerable<T> source1, IEnumerable<T> source2, IComparer<T> comparer) { #region Parameter Validation if (Object.ReferenceEquals(null, source1)) throw new ArgumentNullException("source1"); if (Object.ReferenceEquals(null, source2)) throw new ArgumentNullException("source2"); if (Object.ReferenceEquals(null, comparer)) throw new ArgumentNullException("comparer"); #endregion using (IEnumerator<T> enumerator1 = source1.GetEnumerator(), enumerator2 = source2.GetEnumerator()) { Boolean more1 = enumerator1.MoveNext(); Boolean more2 = enumerator2.MoveNext(); while (more1 && more2) { Int32 comparisonResult = comparer.Compare( enumerator1.Current, enumerator2.Current); if (comparisonResult < 0) { // enumerator 1 has the "lowest" item yield return enumerator1.Current; more1 = enumerator1.MoveNext(); } else if (comparisonResult > 0) { // enumerator 2 has the "lowest" item yield return enumerator2.Current; more2 = enumerator2.MoveNext(); } else { // they're considered equivalent, only yield it once yield return enumerator1.Current; more1 = enumerator1.MoveNext(); more2 = enumerator2.MoveNext(); } } // Yield rest of values from non-exhausted source while (more1) { yield return enumerator1.Current; more1 = enumerator1.MoveNext(); } while (more2) { yield return enumerator2.Current; more2 = enumerator2.MoveNext(); } } }

请注意，如果其中一个来源包含重复项，则可能会在输出中看到重复项。如果要删除已sorting的列表中的这些重复项，请使用以下方法：

 private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source) { return CheapDistinct<T>(source, Comparer<T>.Default); } private static IEnumerable<T> CheapDistinct<T>(IEnumerable<T> source, IComparer<T> comparer) { #region Parameter Validation if (Object.ReferenceEquals(null, source)) throw new ArgumentNullException("source"); if (Object.ReferenceEquals(null, comparer)) throw new ArgumentNullException("comparer"); #endregion using (IEnumerator<T> enumerator = source.GetEnumerator()) { if (enumerator.MoveNext()) { T item = enumerator.Current; // scan until different item found, then produce // the previous distinct item while (enumerator.MoveNext()) { if (comparer.Compare(item, enumerator.Current) != 0) { yield return item; item = enumerator.Current; } } // produce last item that is left over from above loop yield return item; } } }

请注意，这些都不会在内部使用数据结构来保存数据的副本，所以如果input被sorting，它们将很便宜。如果你不能或不能保证，你应该使用你已经find的3.5扩展方法。

以下是调用上述方法的示例代码：

 String[] list_1 = { "apple", "orange", "apple", "banana" }; String[] list_2 = { "banana", "pear", "grape" }; Array.Sort(list_1); Array.Sort(list_2); IEnumerable<String> items = Merge( CheapDistinct(list_1), CheapDistinct(list_2)); foreach (String item in items) Console.Out.WriteLine(item);

可能创build一个散列表作为键（只添加那些不存在的），然后将键转换为一个数组可能是一个可行的解决scheme。

在测量之前，您不知道哪种方法更快。 LINQ的方式是优雅和容易理解的。

另一种方法是将一个集合实现为一个哈希数组（Dictionary），并将这两个数组的所有元素添加到集合中。然后使用set.Keys.ToArray（）方法来创build结果数组。

在.NET中有效地合并string数组，保持不同的值

entity frameworkselect不同的名称

SQL / mysql – selectdistinct / UNIQUE，但返回所有列？

数据库中的2个字段的SQL不同

在电子表格中统计不同的值

SELECT DISTINCT在一列上

获取列表中不同值的列表

MySQL SELECT DISTINCT多列

SQL来查找列中不同值的数量

SQL查询与Row_Number不同

我如何（或可以）在多列上selectDISTINCT？