| 搜索引擎中一個(gè)比較重要的環(huán)節就是從網(wǎng)頁(yè)中抽取出有效內容。簡(jiǎn)單來(lái)說(shuō),就是吧HTML文本中的HTML標記去掉,留下我們用IE等瀏覽器打開(kāi)HTML文檔看到的部分(我們這里不考慮圖片). 將HTML文本中的標記分為:注釋,script ,style,以及其他標記分別去掉: 1.去注釋,正則為: output = Regex.Replace(input, @"<!--[^-]*-->", string.Empty, RegexOptions.IgnoreCase); 2.去script,正則為: ouput = Regex.Replace(input, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); output2 = Regex.Replace(ouput , @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 3.去style,正則為: output = Regex.Replace(input, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 4.去其他HTML標記 result = result.Replace(" ", " "); result = result.Replace(""", "\""); result = result.Replace("<", "<"); result = result.Replace(">", ">"); result = result.Replace("&", "&"); result = result.Replace("<br>", "\r\n"); result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase); 以上的代碼中大家可以看到,我使用了RegexOptions.Singleline參數,這個(gè)參數很重要,他主要是為了讓"."(小圓點(diǎn))可以匹配換行符.如果沒(méi)有這個(gè)參數,大多數情況下,用上面列正則表達式來(lái)消除網(wǎng)頁(yè)HTML標記是無(wú)效的. HTML發(fā)展至今,語(yǔ)法已經(jīng)相當復雜,上面只列出了幾種最主要的標記,更多的去HTML標記的正則我將在 Rost WebSpider 的開(kāi)發(fā)過(guò)程中補充進(jìn)來(lái)。 下面用c#實(shí)現了一個(gè)從HTML字符串中提取有效內容的類(lèi): using System; using System.Collections.Generic; using System.Text; using System.Text.RegularExpressions; class HtmlExtract { #region private attributes private string _strHtml; #endregion #region public mehtods public HtmlExtract(string inStrHtml) { _strHtml = inStrHtml } public override string ExtractText() { string result = _strHtml; result = RemoveComment(result); result = RemoveScript(result); result = RemoveStyle(result); result = RemoveTags(result); return result.Trim(); } #endregion #region private methods private string RemoveComment(string input) { string result = input; //remove comment result = Regex.Replace(result, @"<!--[^-]*-->", string.Empty, RegexOptions.IgnoreCase); return result; } private string RemoveStyle(string input) { string result = input; //remove all styles result = Regex.Replace(result, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); return result; } private string RemoveScript(string input) { string result = input; result = Regex.Replace(result, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); result = Regex.Replace(result, @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); return result; } private string RemoveTags(string input) { string result = input; result = result.Replace(" ", " "); result = result.Replace(""", "\""); result = result.Replace("<", "<"); result = result.Replace(">", ">"); result = result.Replace("&", "&"); result = result.Replace("<br>", "\r\n"); result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase); return result; } #endregion |
聯(lián)系客服