如何将HTML转换为C#中的文本?
我正在寻找C#代码来将HTML文档转换为纯文本。
我不是在寻找简单的标签剥离,而是将输出纯文本与原始布局的合理保存。
输出应该是这样的:
Html2Txt在W3C
我已经看过HTML敏捷包,但我不认为这是我所需要的。 有没有人有任何其他的build议?
编辑:我只是从CodePlex下载HTML敏捷包,并运行Html2Txt项目。 多么令人失望(至less是HTML的文本转换模块)! 它所做的只是去掉标签,平整表格等等。输出看起来不像Html2Txt @ W3C产生的东西。 太糟糕了,这个来源似乎不可用。 我正在查看是否有更多的“jar装”解决scheme。
编辑2:谢谢大家的build议。 FlySwat向我倾斜的方向,我想去。 我可以使用System.Diagnostics.Process
类来运行带有“-dump”开关的lynx.exe,以便将文本发送到标准输出,并使用ProcessStartInfo.UseShellExecute = false
和ProcessStartInfo.RedirectStandardOutput = true
捕获stdout。 我将把所有这些包装在一个C#类中。 这段代码只会偶尔被调用,所以我并不太在意产生一个新的进程而不是在代码中执行。 另外,山猫是快!
你正在寻找的是一个文本模式的DOM渲染器,输出文本,就像Lynx或其他文本浏览器一样…这比你想象的要难得多。
只是关于后代的HtmlAgilityPack的一个笔记。 该项目包含一个parsing文本到html的例子 ,正如OP所指出的那样,它不会像编写HTML的人想象的那样处理空白。 这里有全文呈现的解决scheme,其他人都提到这个问题,这不是(它甚至不能处理当前表格),但是它是轻量级的,速度很快,这是我想要创build一个简单的文本版本的HTML电子邮件。
//small but important modification to class http://htmlagilitypack.codeplex.com/sourcecontrol/changeset/view/62772#52179 public static class HtmlToText { public static string Convert(string path) { HtmlDocument doc = new HtmlDocument(); doc.Load(path); return ConvertDoc(doc); } public static string ConvertHtml(string html) { HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); return ConvertDoc(doc); } public static string ConvertDoc (HtmlDocument doc) { using (StringWriter sw = new StringWriter()) { ConvertTo(doc.DocumentNode, sw); sw.Flush(); return sw.ToString(); } } internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) { foreach (HtmlNode subnode in node.ChildNodes) { ConvertTo(subnode, outText, textInfo); } } public static void ConvertTo(HtmlNode node, TextWriter outText) { ConvertTo(node, outText, new PreceedingDomTextInfo(false)); } internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) { string html; switch (node.NodeType) { case HtmlNodeType.Comment: // don't output comments break; case HtmlNodeType.Document: ConvertContentTo(node, outText, textInfo); break; case HtmlNodeType.Text: // script and style must not be output string parentName = node.ParentNode.Name; if ((parentName == "script") || (parentName == "style")) { break; } // get text html = ((HtmlTextNode)node).Text; // is it in fact a special closing node output as text? if (HtmlNode.IsOverlappedClosingElement(html)) { break; } // check the text is meaningful and not a bunch of whitespaces if (html.Length == 0) { break; } if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) { html= html.TrimStart(); if (html.Length == 0) { break; } textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true; } outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " "))); if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) { outText.Write(' '); } break; case HtmlNodeType.Element: string endElementString = null; bool isInline; bool skip = false; int listIndex = 0; switch (node.Name) { case "nav": skip = true; isInline = false; break; case "body": case "section": case "article": case "aside": case "h1": case "h2": case "header": case "footer": case "address": case "main": case "div": case "p": // stylistic - adjust as you tend to use if (textInfo.IsFirstTextOfDocWritten) { outText.Write("\r\n"); } endElementString = "\r\n"; isInline = false; break; case "br": outText.Write("\r\n"); skip = true; textInfo.WritePrecedingWhiteSpace = false; isInline = true; break; case "a": if (node.Attributes.Contains("href")) { string href = node.Attributes["href"].Value.Trim(); if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase)==-1) { endElementString = "<" + href + ">"; } } isInline = true; break; case "li": if(textInfo.ListIndex>0) { outText.Write("\r\n{0}.\t", textInfo.ListIndex++); } else { outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022 } isInline = false; break; case "ol": listIndex = 1; goto case "ul"; case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems endElementString = "\r\n"; isInline = false; break; case "img": //inline-block in reality if (node.Attributes.Contains("alt")) { outText.Write('[' + node.Attributes["alt"].Value); endElementString = "]"; } if (node.Attributes.Contains("src")) { outText.Write('<' + node.Attributes["src"].Value + '>'); } isInline = true; break; default: isInline = true; break; } if (!skip && node.HasChildNodes) { ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten){ ListIndex = listIndex }); } if (endElementString != null) { outText.Write(endElementString); } break; } } } internal class PreceedingDomTextInfo { public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) { IsFirstTextOfDocWritten = isFirstTextOfDocWritten; } public bool WritePrecedingWhiteSpace {get;set;} public bool LastCharWasSpace { get; set; } public readonly BoolWrapper IsFirstTextOfDocWritten; public int ListIndex { get; set; } } internal class BoolWrapper { public BoolWrapper() { } public bool Value { get; set; } public static implicit operator bool(BoolWrapper boolWrapper) { return boolWrapper.Value; } public static implicit operator BoolWrapper(bool boolWrapper) { return new BoolWrapper{ Value = boolWrapper }; } }
作为一个例子,下面的HTML代码…
<!DOCTYPE HTML> <html> <head> </head> <body> <header> Whatever Inc. </header> <main> <p> Thanks for your enquiry. As this is the 1<sup>st</sup> time you have contacted us, we would like to clarify a few things: </p> <ol> <li> Please confirm this is your email by replying. </li> <li> Then perform this step. </li> </ol> <p> Please solve this <img alt="complex equation" src="http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png"/>. Then, in any order, could you please: </p> <ul> <li> a point. </li> <li> another point, with a <a href="http://en.wikipedia.org/wiki/Hyperlink">hyperlink</a>. </li> </ul> <p> Sincerely, </p> <p> The whatever.com team </p> </main> <footer> Ph: 000 000 000<br/> mail: whatever st </footer> </body> </html>
将被转化为:
Whatever Inc. Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things: 1. Please confirm this is your email by replying. 2. Then perform this step. Please solve this [complex equation<http://upload.wikimedia.org/wikipedia/commons/8/8d/First_Equation_Ever.png>]. Then, in any order, could you please: * a point. * another point, with a hyperlink<http://en.wikipedia.org/wiki/Hyperlink>. Sincerely, The whatever.com team Ph: 000 000 000 mail: whatever st
…而不是:
Whatever Inc. Thanks for your enquiry. As this is the 1st time you have contacted us, we would like to clarify a few things: Please confirm this is your email by replying. Then perform this step. Please solve this . Then, in any order, could you please: a point. another point, with a hyperlink. Sincerely, The whatever.com team Ph: 000 000 000 mail: whatever st
你可以使用这个:
public static string StripHTML(string HTMLText, bool decode = true) { Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase); var stripped = reg.Replace(HTMLText, ""); return decode ? HttpUtility.HtmlDecode(stripped) : stripped; }
更新
感谢您的意见,我已经更新,以改善此function
我从一个可靠的消息来源得知,如果你使用.Net进行HTMLparsing,你应该再次看看HTML敏捷包。
http://www.codeplex.com/htmlagilitypack
SO上的一些示例
HTML敏捷性包 – parsing表
因为我想用LF和项目符号转换成纯文本,所以我在codeproject上find了这个漂亮的解决scheme,它涵盖了许多转换用例:
将HTML转换为纯文本
是的,看起来很大,但工作正常。
假设你有良好的html,你也可以尝试一个XSL转换。
这是一个例子:
using System; using System.IO; using System.Xml.Linq; using System.Xml.XPath; using System.Xml.Xsl; class Html2TextExample { public static string Html2Text(XDocument source) { var writer = new StringWriter(); Html2Text(source, writer); return writer.ToString(); } public static void Html2Text(XDocument source, TextWriter output) { Transformer.Transform(source.CreateReader(), null, output); } public static XslCompiledTransform _transformer; public static XslCompiledTransform Transformer { get { if (_transformer == null) { _transformer = new XslCompiledTransform(); var xsl = XDocument.Parse(@"<?xml version='1.0'?><xsl:stylesheet version=""1.0"" xmlns:xsl=""http://www.w3.org/1999/XSL/Transform"" exclude-result-prefixes=""xsl""><xsl:output method=""html"" indent=""yes"" version=""4.0"" omit-xml-declaration=""yes"" encoding=""UTF-8"" /><xsl:template match=""/""><xsl:value-of select=""."" /></xsl:template></xsl:stylesheet>"); _transformer.Load(xsl.CreateNavigator()); } return _transformer; } } static void Main(string[] args) { var html = XDocument.Parse("<html><body><div>Hello world!</div></body></html>"); var text = Html2Text(html); Console.WriteLine(text); } }
最简单的方法可能是将标签剥离与文本布局元素replace为一些标签,如用于列表元素(li)的破折号和用于br和p的换行符。 把这个扩展到表格应该不难。
我有一些HtmlAgility的解码问题,我不想投入时间调查。
相反,我使用了来自Microsoft Team Foundation API的该实用工具 :
var text = HtmlFilter.ConvertToPlainText(htmlContent);
我不知道C#,但有一个相当小的和容易阅读的Python的html2txt脚本在这里: http ://www.aaronsw.com/2002/html2text/
另一篇文章提出了HTML敏捷包 :
这是一个敏捷的HTMLparsing器,它构build了一个读/写DOM,并支持普通的XPATH或XSLT(实际上,您不需要理解XPATH或XSLT就可以使用它,不用担心)。 这是一个.NET代码库,可以让你parsing“不在网上”的HTML文件。 parsing器对“真实世界”格式错误的HTML非常宽容。 对象模型与提出System.Xml非常相似,但对于HTML文档(或stream)。
我曾经使用过Detagger 。 它将HTML格式化为文本格式并不仅仅是一个标签移除器。
我最近在使用Markdown XSLT文件转换HTML源代码的解决scheme上发表了博文 。 HTML源代码当然需要首先是有效的XML
试试简单实用的方法:只需调用StripHTML(WebBrowserControl_name);
public string StripHTML(WebBrowser webp) { try { doc.execCommand("SelectAll", true, null); IHTMLSelectionObject currentSelection = doc.selection; if (currentSelection != null) { IHTMLTxtRange range = currentSelection.createRange() as IHTMLTxtRange; if (range != null) { currentSelection.empty(); return range.text; } } } catch (Exception ep) { //MessageBox.Show(ep.Message); } return ""; }
在Genexus你可以用正则expression式
&pattern ='<[^>] +>'
&TSTRPNOT =&TSTRPNOT.ReplaceRegEx(图案, “”)
在Genexus possiamo gestirlo con Regex中,
您可以使用WebBrowser控件在内存中呈现您的html内容。 LoadCompleted事件发射后…
IHTMLDocument2 htmlDoc = (IHTMLDocument2)webBrowser.Document; string innerHTML = htmlDoc.body.innerHTML; string innerText = htmlDoc.body.innerText;
如果您使用.NET Framework 4.5,则可以使用System.Net.WebUtility.HtmlDecode(),它采用HTML编码的string并返回解码的string。
在MSDN上logging: http : //msdn.microsoft.com/en-us/library/system.net.webutility.htmldecode( v= vs.110).aspx
您也可以在Windowsapp store应用中使用此function。
这是在C#中将HTML转换为文本或RTF的另一种解决scheme:
SautinSoft.HtmlToRtf h = new SautinSoft.HtmlToRtf(); h.OutputFormat = HtmlToRtf.eOutputFormat.TextUnicode; string text = h.ConvertString(htmlString);
这个库不是免费的,这是商业产品,它是我自己的产品。