如何通过string拆分string，并包括使用.NET的分隔符？

有很多类似的问题，但显然不是完美的匹配，这就是为什么我问。

我想通过string分隔符（例如xx ， yy ）列表来分割一个随机string（例如123xx456yy789 ），并在结果中包含分隔符（这里是123 ， xx ， 456 ， yy ， 789 ）。

良好的performance是一个不错的奖金。如果可能，应避免使用正则expression式。

更新：我做了一些性能检查，并比较了结果（虽然懒得正式检查它们）。 testing的解决scheme是（以随机顺序）：

加布
Guffa
马腹
正则expression式

其他解决scheme没有经过testing，因为它们与另一个解决scheme相似，或者来不及。

这是testing代码：

 class Program { private static readonly List<Func<string, List<string>, List<string>>> Functions; private static readonly List<string> Sources; private static readonly List<List<string>> Delimiters; static Program () { Functions = new List<Func<string, List<string>, List<string>>> (); Functions.Add ((s, l) => s.SplitIncludeDelimiters_Gabe (l).ToList ()); Functions.Add ((s, l) => s.SplitIncludeDelimiters_Guffa (l).ToList ()); Functions.Add ((s, l) => s.SplitIncludeDelimiters_Naive (l).ToList ()); Functions.Add ((s, l) => s.SplitIncludeDelimiters_Regex (l).ToList ()); Sources = new List<string> (); Sources.Add (""); Sources.Add (Guid.NewGuid ().ToString ()); string str = ""; for (int outer = 0; outer < 10; outer++) { for (int i = 0; i < 10; i++) { str += i + "**" + DateTime.UtcNow.Ticks; } str += "-"; } Sources.Add (str); Delimiters = new List<List<string>> (); Delimiters.Add (new List<string> () { }); Delimiters.Add (new List<string> () { "-" }); Delimiters.Add (new List<string> () { "**" }); Delimiters.Add (new List<string> () { "-", "**" }); } private class Result { public readonly int FuncID; public readonly int SrcID; public readonly int DelimID; public readonly long Milliseconds; public readonly List<string> Output; public Result (int funcID, int srcID, int delimID, long milliseconds, List<string> output) { FuncID = funcID; SrcID = srcID; DelimID = delimID; Milliseconds = milliseconds; Output = output; } public void Print () { Console.WriteLine ("S " + SrcID + "\tD " + DelimID + "\tF " + FuncID + "\t" + Milliseconds + "ms"); Console.WriteLine (Output.Count + "\t" + string.Join (" ", Output.Take (10).Select (x => x.Length < 15 ? x : x.Substring (0, 15) + "...").ToArray ())); } } static void Main (string[] args) { var results = new List<Result> (); for (int srcID = 0; srcID < 3; srcID++) { for (int delimID = 0; delimID < 4; delimID++) { for (int funcId = 3; funcId >= 0; funcId--) { // i tried various orders in my tests Stopwatch sw = new Stopwatch (); sw.Start (); var func = Functions[funcId]; var src = Sources[srcID]; var del = Delimiters[delimID]; for (int i = 0; i < 10000; i++) { func (src, del); } var list = func (src, del); sw.Stop (); var res = new Result (funcId, srcID, delimID, sw.ElapsedMilliseconds, list); results.Add (res); res.Print (); } } } } }

正如你所看到的那样，这实际上只是一个快速而肮脏的testing，但是我多次运行testing，顺序不同，结果总是非常一致。对于较大的数据集，测量的时间范围在几毫秒到几秒之间。我在下面的评估中忽略了低毫秒范围内的值，因为在实践中它们似乎可以忽略不计。以下是我的框中的输出：

  S 0 D 0 F 3 11ms
 1
 S 0 D 0 F 2 7ms
 1
 S 0 D 0 F 1 6ms
 1
 S 0 D 0 F 0 4ms
 0
 S 0 D 1 F 3 28ms
 1
 S 0 D 1 F 2 8ms
 1
 S 0 D 1 F 1 7ms
 1
 S 0 D 1 F 0 3ms
 0
 S 0 D 2 F 3 30ms
 1
 S 0 D 2 F 2 8ms
 1
 S 0 D 2 F 1 6ms
 1
 S 0 D 2 F 0 3ms
 0
 S 0 D 3 F 3 30ms
 1
 S 0 D 3 F 2 10ms
 1
 S 0 D 3 F 1 8ms
 1
 S 0 D 3 F 0 3ms
 0
 S 1 D 0 F 3 9ms
 1 9e5282ec-e2a2-4 ...
 S 1 D 0 F 2 6ms
 1 9e5282ec-e2a2-4 ...
 S 1 D 0 F 1 5ms
 1 9e5282ec-e2a2-4 ...
 S 1 D 0 F 0 5ms
 1 9e5282ec-e2a2-4 ...
 S 1 D 1 F 3 63ms
 9 9e5282ec  -  e2a2  -  4265  -  8276  -  6dbb50fdae37
 S 1 D 1 F 2 37ms
 9 9e5282ec  -  e2a2  -  4265  -  8276  -  6dbb50fdae37
 S 1 D 1 F 1 29ms
 9 9e5282ec  -  e2a2  -  4265  -  8276  -  6dbb50fdae37
 S 1 D 1 F 0 22ms
 9 9e5282ec  -  e2a2  -  4265  -  8276  -  6dbb50fdae37
 S 1 D 2 F 3 30ms
 1 9e5282ec-e2a2-4 ...
 S 1 D 2 F 2 10ms
 1 9e5282ec-e2a2-4 ...
 S 1 D 2 F 1 10ms
 1 9e5282ec-e2a2-4 ...
 S 1 D 2 F 0 12ms
 1 9e5282ec-e2a2-4 ...
 S 1 D 3 F 3 73ms
 9 9e5282ec  -  e2a2  -  4265  -  8276  -  6dbb50fdae37
 S 1 D 3 F 2 40ms
 9 9e5282ec  -  e2a2  -  4265  -  8276  -  6dbb50fdae37
 S 1 D 3 F 1 33ms
 9 9e5282ec  -  e2a2  -  4265  -  8276  -  6dbb50fdae37
 S 1 D 3 F 0 30ms
 9 9e5282ec  -  e2a2  -  4265  -  8276  -  6dbb50fdae37
 S 2 D 0 F 3 10ms
 1 0 ** 634226552821 ...
 S 2 D 0 F 2 109ms
 1 0 ** 634226552821 ...
 S 2 D 0 F 1 5ms
 1 0 ** 634226552821 ...
 S 2 D 0 F 0 127ms
 1 0 ** 634226552821 ...
 S 2 D 1 F 3 184ms
 21 0 ** 634226552821 ...  -  0 ** 634226552821 ...  -  0 ** 634226552821 ...  -  0 ** 634226
 552821 ...  -  0 ** 634226552821 ...  - 
 S 2 D 1 F 2 364ms
 21 0 ** 634226552821 ...  -  0 ** 634226552821 ...  -  0 ** 634226552821 ...  -  0 ** 634226
 552821 ...  -  0 ** 634226552821 ...  - 
 S 2 D 1 F 1 134ms
 21 0 ** 634226552821 ...  -  0 ** 634226552821 ...  -  0 ** 634226552821 ...  -  0 ** 634226
 552821 ...  -  0 ** 634226552821 ...  - 
 S 2 D 1 F 0 517ms
 20 0 ** 634226552821 ...  -  0 ** 634226552821 ...  -  0 ** 634226552821 ...  -  0 ** 634226
 552821 ...  -  0 ** 634226552821 ...  - 
 S 2 D 2 F 3 688ms
 201 0 ** 634226552821217 ... ** 634226552821217 ... ** 634226552821217 ... ** 6
 34226552821217 ... **
 S 2 D 2 F 2 2404ms
 201 0 ** 634226552821217 ... ** 634226552821217 ... ** 634226552821217 ... ** 6
 34226552821217 ... **
 S 2 D 2 F 1 874ms
 201 0 ** 634226552821217 ... ** 634226552821217 ... ** 634226552821217 ... ** 6
 34226552821217 ... **
 S 2 D 2 F 0 717ms
 201 0 ** 634226552821217 ... ** 634226552821217 ... ** 634226552821217 ... ** 6
 34226552821217 ... **
 S 2 D 3 F 3 1205ms
 221 0 ** 634226552821217 ... ** 634226552821217 ... ** 634226552821217 ... ** 6
 34226552821217 ... **
 S 2 D 3 F 2 3471ms
 221 0 ** 634226552821217 ... ** 634226552821217 ... ** 634226552821217 ... ** 6
 34226552821217 ... **
 S 2 D 3 F 1 1008ms
 221 0 ** 634226552821217 ... ** 634226552821217 ... ** 634226552821217 ... ** 6
 34226552821217 ... **
 S 2 D 3 F 0 1095ms
 220 0 ** 634226552821217 ... ** 634226552821217 ... ** 634226552821217 ... ** 6
 34226552821217 ... **

我比较了结果，这是我发现的：

所有4个function都足够快速以供常用。
天真的版本（也就是我最初写的）在计算时间方面是最差的。
正则expression式在小数据集上有点慢（可能是由于初始化开销）。
正则expression式在大数据方面效果很好，与非正则expression式的解决scheme速度相当。
性能最好似乎是Guffa的整体版本，这是从代码预计。
Gabe的版本有时会省略一个项目，但我没有调查这个（bug？）。

为了总结这个话题，我build议使用正则expression式，这是相当快的。 如果性能至关重要，我宁愿Guffa的实施。

尽pipe你不愿意使用正则expression式，但通过使用Regex.Split方法，它实际上很好地保留了分隔符：

 string input = "123xx456yy789"; string pattern = "(xx|yy)"; string[] result = Regex.Split(input, pattern);

如果从模式中删除括号，只使用"xx|yy" ，则不会保留分隔符。如果在正则expression式中使用任何具有特殊含义的元字符，请务必在模式上使用Regex.Escape 。字符包括\, *, +, ?, |, {, [, (,), ^, $,., # 。例如，一个分隔符. 应该逃过一劫\. 。给定一个分隔符列表，您需要使用pipe道“或”来对它们进行“或”运算符号，这也是一个逃脱的angular色。要正确地构build模式，请使用以下代码（感谢@gabe指出了这一点）：

 var delimiters = new List<string> { ".", "xx", "yy" }; string pattern = "(" + String.Join("|", delimiters.Select(d => Regex.Escape(d)) .ToArray()) + ")";

括号是串联的，而不是包含在模式中，因为它们会被错误地转义出来。

编辑：另外，如果delimiters列表碰巧是空的，最终模式将不正确() ，这将导致空白匹配。为了防止这种情况，可以使用分隔符的检查。考虑到这一切，片段变成：

 string input = "123xx456yy789"; // to reach the else branch set delimiters to new List(); var delimiters = new List<string> { ".", "xx", "yy", "()" }; if (delimiters.Count > 0) { string pattern = "(" + String.Join("|", delimiters.Select(d => Regex.Escape(d)) .ToArray()) + ")"; string[] result = Regex.Split(input, pattern); foreach (string s in result) { Console.WriteLine(s); } } else { // nothing to split Console.WriteLine(input); }

如果您需要不区分大小写的分隔符匹配，请使用RegexOptions.IgnoreCase选项： Regex.Split(input, pattern, RegexOptions.IgnoreCase)

编辑＃2：到目前为止的解决scheme匹配拆分令牌可能是一个较大的string的子string。如果分割标记应完全匹配，而不是子string的一部分，例如句子中的单词被用作分隔符的场景，则应该在模式周围添加单词边界\b元字符。

例如，考虑这个句子（是的，这是古怪的）： "Welcome to stackoverflow... where the stack never overflows!"

如果分隔符为{ "stack", "flow" }则当前的解决scheme将分割“stackoverflow”并返回3个string{ "stack", "over", "flow" } 。如果你需要一个精确的匹配，那么这个分割的唯一的地方就是在句子后面的单词“stack”，而不是“stackoverflow”。

要实现精确的匹配行为，请改变模式以在\b(delim1|delim2|delimN)\b ：

 string pattern = @"\b(" + String.Join("|", delimiters.Select(d => Regex.Escape(d))) + @")\b";

最后，如果需要在分隔符之前和之后修剪空格，请在\s*(delim1|delim2|delimN)\s* 。这可以与\b结合如下：

 string pattern = @"\s*\b(" + String.Join("|", delimiters.Select(d => Regex.Escape(d))) + @")\b\s*";

好吧，对不起，也许是这个：

  string source = "123xx456yy789"; foreach (string delimiter in delimiters) source = source.Replace(delimiter, ";" + delimiter + ";"); string[] parts = source.Split(';');

这是一个不使用正则expression式的解决scheme，不会产生比所需的更多的string：

 public static List<string> Split(string searchStr, string[] separators) { List<string> result = new List<string>(); int length = searchStr.Length; int lastMatchEnd = 0; for (int i = 0; i < length; i++) { for (int j = 0; j < separators.Length; j++) { string str = separators[j]; int sepLen = str.Length; if (((searchStr[i] == str[0]) && (sepLen <= (length - i))) && ((sepLen == 1) || (String.CompareOrdinal(searchStr, i, str, 0, sepLen) == 0))) { result.Add(searchStr.Substring(lastMatchEnd, i - lastMatchEnd)); result.Add(separators[j]); i += sepLen - 1; lastMatchEnd = i + 1; break; } } } if (lastMatchEnd != length) result.Add(searchStr.Substring(lastMatchEnd)); return result; }

我想出了一个类似的解决scheme。为了有效地分割一个string，你可以保留下一个分隔符的列表。这样可以最大限度地减less查找每个分隔符的时间。

即使对于长string和大量的分隔符，该algorithm也能很好地执行：

 string input = "123xx456yy789"; string[] delimiters = { "xx", "yy" }; int[] nextPosition = delimiters.Select(d => input.IndexOf(d)).ToArray(); List<string> result = new List<string>(); int pos = 0; while (true) { int firstPos = int.MaxValue; string delimiter = null; for (int i = 0; i < nextPosition.Length; i++) { if (nextPosition[i] != -1 && nextPosition[i] < firstPos) { firstPos = nextPosition[i]; delimiter = delimiters[i]; } } if (firstPos != int.MaxValue) { result.Add(input.Substring(pos, firstPos - pos)); result.Add(delimiter); pos = firstPos + delimiter.Length; for (int i = 0; i < nextPosition.Length; i++) { if (nextPosition[i] != -1 && nextPosition[i] < pos) { nextPosition[i] = input.IndexOf(delimiters[i], pos); } } } else { result.Add(input.Substring(pos)); break; } }

（对于任何错误的保留，我现在只是把这个版本扔在一起，而我还没有对它进行testing。）

一个天真的实现

 public IEnumerable<string> SplitX (string text, string[] delimiters) { var split = text.Split (delimiters, StringSplitOptions.None); foreach (string part in split) { yield return part; text = text.Substring (part.Length); string delim = delimiters.FirstOrDefault (x => text.StartsWith (x)); if (delim != null) { yield return delim; text = text.Substring (delim.Length); } } }

这将具有与String.Split默认模式相同的语义（所以不包括空的标记）。

通过使用不安全的代码迭代源string可以使速度更快，但这需要您自己编写迭代机制，而不是使用yield return。它分配绝对最小值（每个非分隔符的子string加上包装枚举符），以便实际地提高性能，您将不得不：

使用更多不安全的代码（通过使用'CompareOrdinal'我是有效的）
- 主要是为了避免在string中使用字符缓冲区进行字符查找的开销
利用关于input源或令牌的领域特定知识。
- 你可能很乐意消除分隔符的空检查
- 你可能知道分隔符几乎不是单独的字符

代码是作为扩展方法编写的

 public static IEnumerable<string> SplitWithTokens( string str, string[] separators) { if (separators == null || separators.Length == 0) { yield return str; yield break; } int prev = 0; for (int i = 0; i < str.Length; i++) { foreach (var sep in separators) { if (!string.IsNullOrEmpty(sep)) { if (((str[i] == sep[0]) && (sep.Length <= (str.Length - i))) && ((sep.Length == 1) || (string.CompareOrdinal(str, i, sep, 0, sep.Length) == 0))) { if (i - prev != 0) yield return str.Substring(prev, i - prev); yield return sep; i += sep.Length - 1; prev = i + 1; break; } } } } if (str.Length - prev > 0) yield return str.Substring(prev, str.Length - prev); }

我的第一篇文章/答案…这是一个recursion的方法。

  static void Split(string src, string[] delims, ref List<string> final) { if (src.Length == 0) return; int endTrimIndex = src.Length; foreach (string delim in delims) { //get the index of the first occurance of this delim int indexOfDelim = src.IndexOf(delim); //check to see if this delim is at the begining of src if (indexOfDelim == 0) { endTrimIndex = delim.Length; break; } //see if this delim comes before previously searched delims else if (indexOfDelim < endTrimIndex && indexOfDelim != -1) endTrimIndex = indexOfDelim; } final.Add(src.Substring(0, endTrimIndex)); Split(src.Remove(0, endTrimIndex), delims, ref final); }

如何通过string拆分string，并包括使用.NET的分隔符？

创build隐藏文件夹

如何追踪log4net的问题

在面板内的控件之上绘制（C＃WinForms）

为什么Visual Studio 2015/2017 Test Runner没有发现我的xUnit v2testing

从C＃中的string调用函数

什么是.NET进程间通信的最佳选择？

TypeConverter与转换与TargetType.Parse

获取Windows用户名 – 不同的方法

在方法名称中使用“asynchronous”后缀取决于是否使用“asynchronous”修饰符？

Java vs C＃：有没有比较执行速度的研究？