亚洲国产无码日韩中文一区二区_ c# 正則表達式對網(wǎng)頁(yè)進(jìn)行有效內容抽取

2010-05-05 21:27

搜索引擎中一個(gè)比較重要的環(huán)節就是從網(wǎng)頁(yè)中抽取出有效內容。簡(jiǎn)單來(lái)說(shuō)，就是吧HTML文本中的HTML標記去掉,留下我們用IE等瀏覽器打開(kāi)HTML文檔看到的部分（我們這里不考慮圖片）.
將HTML文本中的標記分為:注釋,script ,style，以及其他標記分別去掉：
1.去注釋,正則為:
output = Regex.Replace(input, @"", string.Empty, RegexOptions.IgnoreCase);
2.去script,正則為:
ouput = Regex.Replace(input, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
output2 = Regex.Replace(ouput , @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
3.去style,正則為:
output = Regex.Replace(input, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
4.去其他HTML標記
result = result.Replace(" ", " ");
result = result.Replace(""", "\"");
result = result.Replace("<", "<");
result = result.Replace(">", ">");
result = result.Replace("&", "&");
result = result.Replace("<br>", "\r\n");
result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase);
以上的代碼中大家可以看到,我使用了RegexOptions.Singleline參數，這個(gè)參數很重要，他主要是為了讓"."(小圓點(diǎn))可以匹配換行符.如果沒(méi)有這個(gè)參數，大多數情況下，用上面列正則表達式來(lái)消除網(wǎng)頁(yè)HTML標記是無(wú)效的.
HTML發(fā)展至今，語(yǔ)法已經(jīng)相當復雜,上面只列出了幾種最主要的標記,更多的去HTML標記的正則我將在
Rost WebSpider 的開(kāi)發(fā)過(guò)程中補充進(jìn)來(lái)。
下面用c#實(shí)現了一個(gè)從HTML字符串中提取有效內容的類(lèi):
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
class HtmlExtract
{
#region private attributes
private string _strHtml;
#endregion
#region public mehtods
public HtmlExtract(string inStrHtml)
{
_strHtml = inStrHtml
}
public override string ExtractText()
{
string result = _strHtml;
result = RemoveComment(result);
result = RemoveScript(result);
result = RemoveStyle(result);
result = RemoveTags(result);
return result.Trim();
}
#endregion
#region private methods
private string RemoveComment(string input)
{
string result = input;
//remove comment
result = Regex.Replace(result, @"", string.Empty, RegexOptions.IgnoreCase);
return result;
}
private string RemoveStyle(string input)
{
string result = input;
//remove all styles
result = Regex.Replace(result, @"<style[^>]*?>.*?</style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
return result;
}
private string RemoveScript(string input)
{
string result = input;
result = Regex.Replace(result, @"<script[^>]*?>.*?</script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
result = Regex.Replace(result, @"<noscript[^>]*?>.*?</noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
return result;
}
private string RemoveTags(string input)
{
string result = input;
result = result.Replace(" ", " ");
result = result.Replace(""", "\"");
result = result.Replace("<", "<");
result = result.Replace(">", ">");
result = result.Replace("&", "&");
result = result.Replace("<br>", "\r\n");
result = Regex.Replace(result, @"<[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase);
return result;
}
#endregion

本站僅提供存儲服務(wù)，所有內容均由用戶(hù)發(fā)布，如發(fā)現有害或侵權內容，請點(diǎn)擊舉報。

欧美性猛交XXXX免费看蜜桃,成人网18免费韩国,亚洲国产成人精品区综合,欧美日韩一区二区三区高清不卡,亚洲综合一区二区精品久久