如何将HTML转换为C＃中的文本？

我正在寻找C＃代码来将HTML文档转换为纯文本。

我不是在寻找简单的标签剥离，而是将输出纯文本与原始布局的合理保存。

输出应该是这样的：

Html2Txt在W3C

我已经看过HTML敏捷包，但我不认为这是我所需要的。有没有人有任何其他的build议？

编辑：我只是从CodePlex下载HTML敏捷包，并运行Html2Txt项目。多么令人失望（至less是HTML的文本转换模块）！它所做的只是去掉标签，平整表格等等。输出看起来不像Html2Txt @ W3C产生的东西。太糟糕了，这个来源似乎不可用。我正在查看是否有更多的“jar装”解决scheme。

编辑2：谢谢大家的build议。 FlySwat向我倾斜的方向，我想去。我可以使用System.Diagnostics.Process类来运行带有“-dump”开关的lynx.exe，以便将文本发送到标准输出，并使用ProcessStartInfo.UseShellExecute = false和ProcessStartInfo.RedirectStandardOutput = true捕获stdout。我将把所有这些包装在一个C＃类中。这段代码只会偶尔被调用，所以我并不太在意产生一个新的进程而不是在代码中执行。另外，山猫是快！

你正在寻找的是一个文本模式的DOM渲染器，输出文本，就像Lynx或其他文本浏览器一样…这比你想象的要难得多。

只是关于后代的HtmlAgilityPack的一个笔记。该项目包含一个parsing文本到html的例子，正如OP所指出的那样，它不会像编写HTML的人想象的那样处理空白。这里有全文呈现的解决scheme，其他人都提到这个问题，这不是（它甚至不能处理当前表格），但是它是轻量级的，速度很快，这是我想要创build一个简单的文本版本的HTML电子邮件。

 //small but important modification to class http://htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/62772#52179 public static class HtmlToText { public static string Convert(string path) { HtmlDocument doc = new HtmlDocument(); doc.Load(path); return ConvertDoc(doc); } public static string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); return ConvertDoc(doc); } public static string ConvertDoc (HtmlDocument doc) { using (StringWriter sw = new StringWriter()) { ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } } internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) { foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText, textInfo); } } public static void ConvertTo(HtmlNode node, TextWriter outText) { ConvertTo(node, outText, new PreceedingDomTextInfo(false)); } internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) { string html; switch (node.NodeType) { case HtmlNodeType.Comment: // don't output comments break; case HtmlNodeType.Document: ConvertContentTo(node, outText, textInfo); break; case HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) { break; } // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) { break; } // check the text is meaningful and not a bunch of whitespaces if (html.Length == 0) { break; } if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) { html= html.TrimStart(); if (html.Length == 0) { break; } textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true; } outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " "))); if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) { outText.Write(' '); } break; case HtmlNodeType.Element: string endElementString = null; bool isInline; bool skip = false; int listIndex = 0; switch (node.Name) { case "nav": skip = true; isInline = false; break; case "body": case "section": case "article": case "aside": case "h1": case "h2": case "header": case "footer": case "address": case "main": case "div": case "p": // stylistic - adjust as you tend to use if (textInfo.IsFirstTextOfDocWritten) { outText.Write("\r\n"); } endElementString = "\r\n"; isInline = false; break; case "br": outText.Write("\r\n"); skip = true; textInfo.WritePrecedingWhiteSpace = false; isInline = true; break; case "a": if (node.Attributes.Contains("href")) { string href = node.Attributes["href"].Value.Trim(); if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1) { endElementString = "<" + href + ">"; } } isInline = true; break; case "li": if(textInfo.ListIndex>0) { outText.Write("\r\n{0}.\t", textInfo.ListIndex++); } else { outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022 } isInline = false; break; case "ol": listIndex = 1; goto case "ul"; case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems endElementString = "\r\n"; isInline = false; break; case "img": //inline-block in reality if (node.Attributes.Contains("alt")) { outText.Write('[' + node.Attributes["alt"].Value); endElementString = "]"; } if (node.Attributes.Contains("src")) { outText.Write('<' + node.Attributes["src"].Value + '>'); } isInline = true; break; default: isInline = true; break; } if (!skip && node.HasChildNodes) { ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex }); } if (endElementString != null) { outText.Write(endElementString); } break; } } } internal class PreceedingDomTextInfo { public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) { IsFirstTextOfDocWritten = isFirstTextOfDocWritten; } public bool WritePrecedingWhiteSpace {get;set;} public bool LastCharWasSpace { get; set; } public readonly BoolWrapper IsFirstTextOfDocWritten; public int ListIndex { get; set; } } internal class BoolWrapper { public BoolWrapper() { } public bool Value { get; set; } public static implicit operator bool(BoolWrapper boolWrapper) { return boolWrapper.Value; } public static implicit operator BoolWrapper(bool boolWrapper) { return new BoolWrapper{ Value = boolWrapper }; } }

作为一个例子，下面的HTML代码…

 <!DOCTYPE HTML> <html> <head> </head> <body> <header> Whatever Inc. </header> <main> <p> Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things: </p> <ol> <li> Please confirm this is your email by replying. </li> <li> Then perform this step. </li> </ol> <p> Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please: </p> <ul> <li> a point. </li> <li> another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>. </li> </ul> <p> Sincerely, </p> <p> The whatever.com team </p> </main> <footer> Ph: 000 000 000<br/> mail: whatever st </footer> </body> </html>

将被转化为：

 Whatever Inc. Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things: 1. Please confirm this is your email by replying. 2. Then perform this step. Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please: * a point. * another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>. Sincerely, The whatever.com team Ph: 000 000 000 mail: whatever st

…而不是：

  Whatever Inc. Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things: Please confirm this is your email by replying. Then perform this step. Please solve this . Then, in any order, could you please: a point. another point, with a hyperlink. Sincerely, The whatever.com team Ph: 000 000 000 mail: whatever st

你可以使用这个：

  public static string StripHTML(string HTMLText, bool decode = true) { Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase); var stripped = reg.Replace(HTMLText, ""); return decode ? HttpUtility.HtmlDecode(stripped) : stripped; }

更新

感谢您的意见，我已经更新，以改善此function

我从一个可靠的消息来源得知，如果你使用.Net进行HTMLparsing，你应该再次看看HTML敏捷包。

http://www.codeplex.com/htmlagilitypack

SO上的一些示例

HTML敏捷性包 – parsing表

你有没有尝试http://www.aaronsw.com/2002/html2text/它是Python，但开源。;

因为我想用LF和项目符号转换成纯文本，所以我在codeproject上find了这个漂亮的解决scheme，它涵盖了许多转换用例：

将HTML转换为纯文本

是的，看起来很大，但工作正常。

假设你有良好的html，你也可以尝试一个XSL转换。

这是一个例子：

 using System; using System.IO; using System.Xml.Linq; using System.Xml.XPath; using System.Xml.Xsl; class Html2TextExample { public static string Html2Text(XDocument source) { var writer = new StringWriter(); Html2Text(source, writer); return writer.ToString(); } public static void Html2Text(XDocument source, TextWriter output) { Transformer.Transform(source.CreateReader(), null, output); } public static XslCompiledTransform _transformer; public static XslCompiledTransform Transformer { get { if (_transformer == null) { _transformer = new XslCompiledTransform(); var xsl = XDocument.Parse(@"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>"); _transformer.Load(xsl.CreateNavigator()); } return _transformer; } } static void Main(string[] args) { var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>"); var text = Html2Text(html); Console.WriteLine(text); } }

最简单的方法可能是将标签剥离与文本布局元素replace为一些标签，如用于列表元素（li）的破折号和用于br和p的换行符。把这个扩展到表格应该不难。

我有一些HtmlAgility的解码问题，我不想投入时间调查。

相反，我使用了来自Microsoft Team Foundation API的该实用工具：

 var text = HtmlFilter.ConvertToPlainText(htmlContent);

我不知道C＃，但有一个相当小的和容易阅读的Python的html2txt脚本在这里： http ://www.aaronsw.com/2002/html2text/

另一篇文章提出了HTML敏捷包：

这是一个敏捷的HTMLparsing器，它构build了一个读/写DOM，并支持普通的XPATH或XSLT（实际上，您不需要理解XPATH或XSLT就可以使用它，不用担心）。这是一个.NET代码库，可以让你parsing“不在网上”的HTML文件。 parsing器对“真实世界”格式错误的HTML非常宽容。对象模型与提出System.Xml非常相似，但对于HTML文档（或stream）。

我曾经使用过Detagger 。它将HTML格式化为文本格式并不仅仅是一个标签移除器。

我最近在使用Markdown XSLT文件转换HTML源代码的解决scheme上发表了博文。 HTML源代码当然需要首先是有效的XML

试试简单实用的方法：只需调用StripHTML(WebBrowserControl_name);

  public string StripHTML(WebBrowser webp) { try { doc.execCommand("SelectAll", true, null); IHTMLSelectionObject currentSelection = doc.selection; if (currentSelection != null) { IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange; if (range != null) { currentSelection.empty(); return range.text; } } } catch (Exception ep) { //MessageBox.Show(ep.Message); } return ""; }

在Genexus你可以用正则expression式

＆pattern ='<[^>] +>'

＆TSTRPNOT =＆TSTRPNOT.ReplaceRegEx（图案， “”）

在Genexus possiamo gestirlo con Regex中，

您可以使用WebBrowser控件在内存中呈现您的html内容。 LoadCompleted事件发射后…

 IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document; string innerHTML = htmlDoc.body.innerHTML; string innerText = htmlDoc.body.innerText;

如果您使用.NET Framework 4.5，则可以使用System.Net.WebUtility.HtmlDecode（），它采用HTML编码的string并返回解码的string。

在MSDN上logging： http : //msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode( v= vs.110).aspx

您也可以在Windowsapp store应用中使用此function。

这是在C＃中将HTML转换为文本或RTF的另一种解决scheme：

  SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf(); h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode; string text = h.ConvertString(htmlString);

这个库不是免费的，这是商业产品，它是我自己的产品。

如何将HTML转换为C＃中的文本？

哪个unit testing框架？

在C＃中完成/处理模式

如何使方法调用C＃类中的另一个？

使用Visual Studio 2010从命令行构buildC＃解决scheme

从C＃中的存储过程获取返回值

序列不包含匹配的元素

如何清理entity framework对象上下文？

将Unicodestring转换为转义的ASCIIstring

.NET有一种方法来检查List a是否包含List b中的所有项目？

获取C中的当前时间