从string中移除HTML标记，包括C＃中的＆nbsp;

如何在C＃中使用正则expression式来移除所有的HTML标签。我的string看起来像

"<div>hello</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;</div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div><div><br></div>"

如果您不能使用面向HTMLparsing器的解决scheme来过滤掉标签，那么下面是一个简单的正则expression式。

 string noHTML = Regex.Replace(inputHTML, @"<[^>]+>|&nbsp;", "").Trim();

理想情况下，你应该再次通过一个正则expression式filter来处理多个空格

 string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

我拿@Ravi Thapliyal的代码做了一个方法：它很简单，可能不会清理所有东西，但到目前为止，它正在做我所需要的。

 public static string ScrubHtml(string value) { var step1 = Regex.Replace(value, @"<[^>]+>|&nbsp;", "").Trim(); var step2 = Regex.Replace(step1, @"\s{2,}", " "); return step2; }

我一直在使用这个函数。删除几乎任何杂乱的HTML，你可以扔在它，并保持文本完好无损。

  private static readonly Regex _tags_ = new Regex(@"<[^>]+?>", RegexOptions.Multiline | RegexOptions.Compiled); //add characters that are should not be removed to this regex private static readonly Regex _notOkCharacter_ = new Regex(@"[^\w;&#@.:/\\?=|%!() -]", RegexOptions.Compiled); public static String UnHtml(String html) { html = HttpUtility.UrlDecode(html); html = HttpUtility.HtmlDecode(html); html = RemoveTag(html, "<!--", "-->"); html = RemoveTag(html, "<script", "</script>"); html = RemoveTag(html, "<style", "</style>"); //replace matches of these regexes with space html = _tags_.Replace(html, " "); html = _notOkCharacter_.Replace(html, " "); html = SingleSpacedTrim(html); return html; } private static String RemoveTag(String html, String startTag, String endTag) { Boolean bAgain; do { bAgain = false; Int32 startTagPos = html.IndexOf(startTag, 0, StringComparison.CurrentCultureIgnoreCase); if (startTagPos < 0) continue; Int32 endTagPos = html.IndexOf(endTag, startTagPos + 1, StringComparison.CurrentCultureIgnoreCase); if (endTagPos <= startTagPos) continue; html = html.Remove(startTagPos, endTagPos - startTagPos + endTag.Length); bAgain = true; } while (bAgain); return html; } private static String SingleSpacedTrim(String inString) { StringBuilder sb = new StringBuilder(); Boolean inBlanks = false; foreach (Char c in inString) { switch (c) { case '\r': case '\n': case '\t': case ' ': if (!inBlanks) { inBlanks = true; sb.Append(' '); } continue; default: inBlanks = false; sb.Append(c); break; } } return sb.ToString().Trim(); }

 var noHtml = Regex.Replace(inputHTML, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();

这个：

 (<.+?> | &nbsp;)

会匹配任何标签或 

 string regex = @"(<.+?>|&nbsp;)"; var x = Regex.Replace(originalString, regex, "").Trim();

那么x = hello

清理Html文档涉及很多棘手的事情。这个包可能有帮助： https ： //github.com/mganss/HtmlSanitizer

 (<([^>]+)>|&nbsp;)

你可以在这里testing它： https ： //regex101.com/r/kB0rQ4/1

从string中移除HTML标记，包括C＃中的＆nbsp;

什么是locking多个std :: mutex'es的最好方法？

如何忽略List <string>中的区分大小写

std :: fstream缓冲vs手动缓冲（为什么手动缓冲10倍增益）？

如何创build一个dynamic的整数数组

locking语句与Monitor.Enter方法

程序无法启动，因为libgcc_s_dw2-1.dll文件丢失

以编程方式检测物理处理器/内核的数量，或者如果Windows，Mac和Linux上的超线程处于活动状态

为什么在没有返回值的情况下stream出非void函数的结尾不会产生编译器错误？

使用多个MemoryCache实例

无法将“microsoft.Office.Interop.Excel.ApplicationClass”types的COM对象转换为“microsoft.Office.Interop.Excel.Application”