在string集合中search最快的方法

问题：

我有一个约12万个用户（string）的文本文件，我想存储在一个集合中，然后在该集合上执行search。

每次用户更改文本TextBox的文本时都会发生search方法，并且结果应该是包含 TextBox本的string。

我不必更改列表，只需将结果ListBox到列表框中即可。

我到目前为止所尝试的是：

我尝试了两个不同的集合/容器，我从一个外部文本文件（当然是一次）转储string条目：

List<string> allUsers;
HashSet<string> allUsers;

通过以下的LINQ查询：

allUsers.Where(item => item.Contains(textBox_search.Text)).ToList();

我的search事件（用户更改search文本时触发）：

 private void textBox_search_TextChanged(object sender, EventArgs e) { if (textBox_search.Text.Length > 2) { listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList(); } else { listBox_choices.DataSource = null; } }

结果：

两者都给了我一个很差的响应时间（每个按键之间大约1-3秒）。

题：

你认为我的瓶颈在哪里？我用过的集合？ search方法？都？

我怎样才能获得更好的性能和更stream畅的function？

您可以考虑在后台线程上执行过滤任务，在完成后调用callback方法，或者如果更改input，则只需重新启动过滤。

总的想法是能够像这样使用它：

 public partial class YourForm : Form { private readonly BackgroundWordFilter _filter; public YourForm() { InitializeComponent(); // setup the background worker to return no more than 10 items, // and to set ListBox.DataSource when results are ready _filter = new BackgroundWordFilter ( items: GetDictionaryItems(), maxItemsToMatch: 10, callback: results => this.Invoke(new Action(() => listBox_choices.DataSource = results)) ); } private void textBox_search_TextChanged(object sender, EventArgs e) { // this will update the background worker's "current entry" _filter.SetCurrentEntry(textBox_search.Text); } }

粗略的草图会是这样的：

 public class BackgroundWordFilter : IDisposable { private readonly List<string> _items; private readonly AutoResetEvent _signal = new AutoResetEvent(false); private readonly Thread _workerThread; private readonly int _maxItemsToMatch; private readonly Action<List<string>> _callback; private volatile bool _shouldRun = true; private volatile string _currentEntry = null; public BackgroundWordFilter( List<string> items, int maxItemsToMatch, Action<List<string>> callback) { _items = items; _callback = callback; _maxItemsToMatch = maxItemsToMatch; // start the long-lived backgroud thread _workerThread = new Thread(WorkerLoop) { IsBackground = true, Priority = ThreadPriority.BelowNormal }; _workerThread.Start(); } public void SetCurrentEntry(string currentEntry) { // set the current entry and signal the worker thread _currentEntry = currentEntry; _signal.Set(); } void WorkerLoop() { while (_shouldRun) { // wait here until there is a new entry _signal.WaitOne(); if (!_shouldRun) return; var entry = _currentEntry; var results = new List<string>(); // if there is nothing to process, // return an empty list if (string.IsNullOrEmpty(entry)) { _callback(results); continue; } // do the search in a for-loop to // allow early termination when current entry // is changed on a different thread foreach (var i in _items) { // if matched, add to the list of results if (i.Contains(entry)) results.Add(i); // check if the current entry was updated in the meantime, // or we found enough items if (entry != _currentEntry || results.Count >= _maxItemsToMatch) break; } if (entry == _currentEntry) _callback(results); } } public void Dispose() { // we are using AutoResetEvent and a background thread // and therefore must dispose it explicitly Dispose(true); } private void Dispose(bool disposing) { if (!disposing) return; // shutdown the thread if (_workerThread.IsAlive) { _shouldRun = false; _currentEntry = null; _signal.Set(); _workerThread.Join(); } // if targetting .NET 3.5 or older, we have to // use the explicit IDisposable implementation (_signal as IDisposable).Dispose(); } }

另外，当处理父Form时，实际应该处理_filter实例。这意味着你应该打开并编辑你的Form的Dispose方法（在YourForm.Designer.cs文件中），看起来像这样：

 // inside "xxxxxx.Designer.cs" protected override void Dispose(bool disposing) { if (disposing) { if (_filter != null) _filter.Dispose(); // this part is added by Visual Studio designer if (components != null) components.Dispose(); } base.Dispose(disposing); }

在我的机器上，它的运行速度非常快，所以在进行更复杂的解决scheme之前，您应该testing并分析这个问题。

也就是说，一个“更复杂的解决scheme”可能是将最后几个结果存储在一个字典中，然后只有当新条目只有最后一个字符的第一个字符不同时，才对它们进行过滤。

我做了一些testing，search一个120,000个项目的列表，并且填写一个新的列表，这些项目所花费的时间可以忽略不计（即使所有string都匹配了大约1/50秒）。

因此，您所看到的问题必须来自数据源的填充：

 listBox_choices.DataSource = ...

我怀疑你只是把太多的项目放入列表框。

也许你应该试着将它限制在前20条，像这样：

 listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)) .Take(20).ToList();

还要注意（正如其他人指出的那样），您正在访问所有用户中每个项目的TextBox.Text属性。这可以很容易地修复如下：

 string target = textBox_search.Text; listBox_choices.DataSource = allUsers.Where(item => item.Contains(target)) .Take(20).ToList();

但是，我计算了访问TextBox.Text需要花费多less时间，花了0.7万秒，远远less于OP中提到的1 – 3秒。不过，这是一个有价值的优化。

使用后缀树作为索引。或者只是build立一个有序的字典，将每个名字的每个后缀与相应名字的列表关联起来。

对于input：

 Abraham Barbara Abram

结构看起来像：

 a -> Barbara ab -> Abram abraham -> Abraham abram -> Abram am -> Abraham, Abram aham -> Abraham ara -> Barbara arbara -> Barbara bara -> Barbara barbara -> Barbara bram -> Abram braham -> Abraham ham -> Abraham m -> Abraham, Abram raham -> Abraham ram -> Abram rbara -> Barbara

searchalgorithm

假设用户input“胸罩”。

将用户input中的词典对分，以查找用户input或其可能的位置。这样我们发现“巴巴拉” – 最后一个关键低于“胸罩”。它被称为“文胸”的下界。 search将需要对数时间。
从find的密钥开始迭代，直到用户input不再匹配。这会给“bram” – >亚伯兰和“braham” – >亚伯拉罕。
将迭代结果（Abram，Abraham）连接并输出。

这样的树被devise用于快速search子串。它的性能接近O（log n）。我相信这种方法可以快速的被GUI线程直接使用。而且，由于没有同步开销，它将工作得更快，然后线程化解决scheme。

你需要一个文本search引擎（如Lucene.Net ）或数据库（你可能会考虑像SQL CE ， SQLite等embedded式的）。换句话说，你需要一个索引search。基于散列的search在这里不适用，因为您search子string，而基于散列的search很适合search确切的值。

否则，这将是一个循环遍历集合的迭代search。

有一个“debounce”types的事件也许是有用的。这与限制的区别在于它在等待一段时间（例如，200毫秒）之后才能完成更改以在事件发生之前完成。

见Debounce and Throttle：一个视觉解释关于反弹的更多信息。我明白，这篇文章是JavaScript的重点，而不是C＃，但是这个原则适用。

这样做的好处是，当你还在input查询时，它不会search。它应该停止尝试一次执行两次search。

更新：

我做了一些分析。

（更新3）

列表内容：从0到2.499.999生成的数字
过滤文本：123（20.477结果）
Core i5-2500，Win7 64bit，8GB RAM
VS2012 + JetBrains dotTrace

2.500.000logging的最初testing运行花了我20.000ms。

第一个罪魁祸首是对Contains textBox_search.Text的调用。这会调用每个元素到文本框的昂贵的get_WindowText方法。只需将代码更改为：

  var text = textBox_search.Text; listBox_choices.DataSource = allUsers.Where(item => item.Contains(text)).ToList();

将执行时间缩短到1.858ms 。

更新2：

另外两个重要的瓶颈现在是对string.Contains （约45％的执行时间）和set_Datasource （30％）中列表框元素的更新的set_Datasource 。

我们可以在速度和内存使用之间进行权衡，方法是创build一个后缀树，因为Basilevsbuild议减less必要的比较次数，并在按下按键后从search中将一些处理时间推到从文件加载名称对用户可能是优选的。

为了提高将元素加载到列表框中的性能，我build议仅加载前几个元素，并向用户指示还有其他元素可用。通过这种方式，您可以向用户提供可用结果的反馈，以便通过input更多字母来优化search结果，或者只需按一下button即可加载完整列表。

使用BeginUpdate和EndUpdate并没有改变set_Datasource的执行时间。

~~正如其他人在这里指出的，LINQ查询本身运行得非常快。~~ ~~我相信你的瓶颈是更新列表框本身。~~ ~~你可以尝试像这样：~~

 if (textBox_search.Text.Length > 2) { listBox_choices.BeginUpdate(); listBox_choices.DataSource = allUsers.Where(item => item.Contains(textBox_search.Text)).ToList(); listBox_choices.EndUpdate(); }

我希望这有帮助。

在另一个线程上运行search，并在该线程正在运行时显示一些加载animation或进度条。

您也可以尝试并行化LINQ查询。

 var queryResults = strings.AsParallel().Where(item => item.Contains("1")).ToList();

这是一个基准，certificate了AsParallel（）的性能优势：

 { IEnumerable<string> queryResults; bool useParallel = true; var strings = new List<string>(); for (int i = 0; i < 2500000; i++) strings.Add(i.ToString()); var stp = new Stopwatch(); stp.Start(); if (useParallel) queryResults = strings.AsParallel().Where(item => item.Contains("1")).ToList(); else queryResults = strings.Where(item => item.Contains("1")).ToList(); stp.Stop(); Console.WriteLine("useParallel: {0}\r\nTime Elapsed: {1}", useParallel, stp.ElapsedMilliseconds); }

假设你只是通过前缀匹配，你正在寻找的数据结构被称为trie ，也被称为“前缀树”。您现在使用的IEnumerable.Where方法将不得不在每次访问时遍历字典中的所有项目。

这个线程显示了如何在C＃中创build一个trie。

WinForms ListBox控件真的是你的敌人在这里。加载logging的速度会很慢，ScrollBar会打击你显示所有120,000条logging。

尝试使用一个旧的DataGridView数据来源与DataTable与单个[用户名]列保存您的数据：

 private DataTable dt; public Form1() { InitializeComponent(); dt = new DataTable(); dt.Columns.Add("UserName"); for (int i = 0; i < 120000; ++i){ DataRow dr = dt.NewRow(); dr[0] = "user" + i.ToString(); dt.Rows.Add(dr); } dgv.AutoSizeColumnsMode = DataGridViewAutoSizeColumnsMode.Fill; dgv.AllowUserToAddRows = false; dgv.AllowUserToDeleteRows = false; dgv.RowHeadersVisible = false; dgv.DataSource = dt; }

然后在TextBox的TextChanged事件中使用DataView来过滤数据：

 private void textBox1_TextChanged(object sender, EventArgs e) { DataView dv = new DataView(dt); dv.RowFilter = string.Format("[UserName] LIKE '%{0}%'", textBox1.Text); dgv.DataSource = dv; }

首先，我将更改ListControl如何查看数据源，将结果IEnumerable<string>转换为List<string> 。特别是当你只input几个字符，这可能是低效的（和不需要的）。 不要做大量的数据拷贝 。

我会将.Where()结果包装到仅实现IList （search）所要求的集合中。这样可以节省您为每个input的字符创build一个新的大列表。
作为替代scheme，我会避免LINQ，我会写一些更具体的（和优化）。保持你的列表在内存中，并build立一个匹配的索引数组，重用数组，所以你不必为每个search重新分配它。

第二步就是不要在大的列表中search小的就足够了。当用户开始input“ab”，并添加“c”，则不需要在大列表中进行研究，在已过滤列表中search就足够了（而且更快）。 细化search每次都是可能的，每次都不要执行全search。

第三步可能会更困难： 保持数据的组织，以便快速search 。现在你必须改变你用来存储数据的结构。想象这样一棵树：

 ABC
 添加更好的Ceil
 骨轮廓以上

这可以简单地用一个数组来实现（如果你正在使用ANSI名称，否则字典会更好）。像这样构build列表（插图的目的，它匹配string的开头）：

 var dictionary = new Dictionary<char, List<string>>(); foreach (var user in users) { char letter = user[0]; if (dictionary.Contains(letter)) dictionary[letter].Add(user); else { var newList = new List<string>(); newList.Add(user); dictionary.Add(letter, newList); } }

search将使用第一个字符完成：

 char letter = textBox_search.Text[0]; if (dictionary.Contains(letter)) { listBox_choices.DataSource = new MyListWrapper(dictionary[letter].Where(x => x.Contains(textBox_search.Text))); }

请注意，我使用了MyListWrapper()正如第一步中所build议的那样（但为了简洁起见，我忽略了第二个build议，如果您select正确的字典大小，您可以保持每个列表的简短和快速 – 也许 – 避免其他）。此外请注意，您可能会尝试使用您的字典的前两个字符（更多列表和更短）。如果你延长这一点，你会有一棵树（但我不认为你有这么多的项目）。

有很多不同的stringsearchalgorithm （有相关的数据结构），只是提到几个：

基于有限状态自动机的search ：在这种方法中，我们通过构造一个确定性的有限自动机（DFA）来避免回溯。这些build筑很昂贵 – 通常是使用电力设施build造的，但使用起来非常快。
存根：Knuth-Morris-Pratt计算一个DFA，以string的forms识别input作为后缀，Boyer-Moore从针的末端开始search，所以它通常可以在每一步都跳到一个完整的针长度。 Baeza-Yates跟踪前面的j个字符是否是searchstring的前缀，因此适用于模糊stringsearch。 bitapalgorithm是Baeza-Yates方法的一个应用。
索引方法 ：更快的searchalgorithm是基于文本的预处理。在build立子串索引（例如后缀树或后缀数组）之后，可以快速find模式的出现。
其他变体 ：一些search方法，例如三字母search，旨在findsearchstring和文本之间的“亲密度”分数，而不是“匹配/不匹配”。这些有时被称为“模糊”search。

很less有关于并行search的文字。这是可能的，但它很less微不足道的，因为开销并行可以很容易地高于search本身。我不会并行执行search（分区和同步将变得很快，也许很复杂），但我会将search移到一个单独的线程 。如果主线程不忙，用户在input时不会感觉到任何延迟（如果在200毫秒之后列表出现，他们将不会注意到，但是如果在input之后等待50毫秒，他们会感到不舒服）。当然，search本身必须足够快，在这种情况下，您不要使用线程来加速search，而是保持您的UI响应 。请注意，一个单独的线程不会让你的查询更快 ，它不会挂起UI，但是如果你的查询很慢，它将在一个单独的线程中缓慢（此外，你也必须处理多个连续的请求）。

你可以尝试使用PLINQ （并行LINQ）。虽然这不保证提高速度，但你需要通过反复试验找出答案。

我怀疑你可以做得更快，但是肯定你应该：

a）使用AsParallel LINQ扩展方法

a）使用某种计时器来延迟过滤

b）在另一个线程上放置一个过滤方法

在某处保留某种string previousTextBoxValue 。做一个延迟1000毫秒的计时器，如果previousTextBoxValue和textbox.Text值相同，就会触发tick。如果不是，则将previousTextBoxValue重新分配给当前值并重置计时器。将计时器开始设置为文本框更改的事件，它将使您的应用程序更stream畅。在1-3秒内过滤120,000条logging是可以的，但是您的用户界面必须保持响应。

你也可以尝试使用BindingSource.Filter函数。我已经使用它，它像一个魅力，从一堆logging过滤，每次更新此属性与正在search的文本。另一种select是使用AutoCompleteSource进行TextBox控制。

希望能帮助到你！

我会尝试sorting收集，search匹配只有开始部分和限制search一些数字。

所以在ininialization

 allUsers.Sort();

并search

 allUsers.Where(item => item.StartWith(textBox_search.Text))

也许你可以添加一些caching。

使用并行LINQ 。 PLINQ是LINQ to Objects的并行实现。 PLINQ实现了全套LINQ标准查询操作符作为T：System.Linq命名空间的扩展方法，并且具有用于并行操作的附加操作符。 PLINQ将LINQ语法的简单性和可读性与并行编程的强大function相结合。就像针对任务并行库的代码一样，PLINQ查询根据主机的能力来调整并发程度。

PLINQ简介

了解PLINQ中的加速

你也可以使用Lucene.Net

Lucene.Net是Lucenesearch引擎库的一个端口，用C＃编写，面向.NET运行时用户。 Lucenesearch库是基于倒排索引的。 Lucene.Net有三个主要目标：

根据我所看到的，我同意把这份名单sorting。

但是，在列表构build时进行sorting将非常缓慢，在构build时进行sorting，将会有更好的执行时间。

否则，如果您不需要显示列表或保留顺序，请使用散列表。

哈希映射将散列您的string，并在确切的偏移量search。我认为应该更快。

尝试使用BinarySearch方法，它应该更快，然后包含方法。

包含将是一个O（n）BinarySearch是一个O（LG（N））

我认为sorting后的集合应该在search时更快，添加新元素时要慢，但据我所知，你只search性能问题。

在string集合中search最快的方法

更新：

更新2：

Java：从Set <String>枚举

哪一个更有效率，一个for-each循环还是一个迭代器？

按值（Java）对Map <Key，Value>进行sorting

有序和sorting的集合有什么区别？

设置Python的collections.defaultdict的默认string值

如何结合两个词典没有循环？

过滤Java集合的最佳方法是什么？

JUnit 4比较集合

如何在Java中反转列表？

在java中最好的方法将列表转换为逗号分隔的string