亚洲人aⅤ高清无码_ Nutch version 0.8 安裝向導

Nutch version 0.8 安裝向導 Nutch version 0.8 安裝向導
1、必要的條件
1.1 Java 1.4或1.4以上版本。操作系統推薦用Linux（Sun或IBM的都可以）。記得在環(huán)境變量中設置變量NUTCH_JAVA_HOME=你的虛擬機地址，例如，本人將jdk1.5安裝在c:\jdk1.5文件夾下，所以本人的設置為NUTCH_JAVA_HOME=c:\jdk1.5（此為win32 環(huán)境下的設置方法）。
1.2 服務(wù)器端推薦使用Apache’s Tomcat 4.x或該版本以上的Tomcat。
1.3 當要在win32安裝Nutch時(shí)，請安裝cygwin軟件，以提供Linux的shell支持。
1.4 安裝Nutch需要消耗Ｇ字節的磁盤(pán)空間，高速的連接并要花費一個(gè)小時(shí)左右的時(shí)間等等。
2、從這開(kāi)始
2.1 首先，你必須獲得Nutch源碼的一個(gè)拷貝。你可以從網(wǎng)址：http://lucene.apache.org/nutch/release/　上下載Nutch的發(fā)行版，解開(kāi)下載的文件包即可?；蛘咄╯ubversion獲得最新的源碼并且通過(guò)Ant工具創(chuàng )建Nutch。
2.2 上述步驟完成以后，你可以通過(guò)下面這個(gè)命令，試試是否安裝成功。
在Nutch所在的目錄下，輸入  bin/nutch
如果顯示了一個(gè)有關(guān)Nutch命令腳本的文檔，那么恭喜你，你已經(jīng)向成功邁出了重要的一步。
2.3 現在，我們可以準備為我們的搜索引摯去“爬行（crawl）”資料。爬行（crawl）有兩種方法：
2.3.1 用crwal命令實(shí)現內部網(wǎng)的爬行
2.3.2 整個(gè)web網(wǎng)的爬行，除了上面的crwal命令外，我們需要用得一些更為底層的命令以實(shí)現更為強大的功能，如inject, generate, fetch以及updatedb等。
3、內部網(wǎng)爬行（測試未通過(guò)）
內部網(wǎng)爬行適合用于具有百萬(wàn)級別的web網(wǎng)站。
3.1 內部網(wǎng)：配置
要配置內部網(wǎng)爬行，你必需做如下幾項工作：
3.1.1 在nutch所在的文件夾下建立一個(gè)包含純文本文件的根文件夾urls。例如，為了爬行nutch站點(diǎn)，你可以建立一個(gè)nutch文本文件，該文件中僅僅包含nutch的主頁(yè)。所有有關(guān)Nutch的其它頁(yè)面你將從這個(gè)頁(yè)面搜索得到。這樣你在urls/nutch文件中將包含如下的內容：
http://lucene.apache.org/nutch/
3.1.2 接著(zhù)你要去編輯nutch文件夾下的conf/crawl-urlfilter.txt文件，將該文件中MY.DOMAIN.NAME替換成你要去爬行的域。例如，如果你想把爬行限制在apache.org域，你就可用apache.org替換上述文件中的MY.DOMAIN.NAME。替換后如下：
+^http://([a-z0-9]*\.)*apache.org/
上述語(yǔ)句的意思包含在apache.org域中的任何url。
3.2 內部網(wǎng)：運行crawl
一旦配置好后，運行crawl是一件簡(jiǎn)單的事情。只要使用crawl命令。這個(gè)命令包含如下這些先項：
-dir  dir指定將爬行到信息要存儲的目錄
-threads threads決定了要同時(shí)運行的線(xiàn)程數
-depth depth指出從根頁(yè)面往下要爬行的深度
-topN topN決定了在每一級的深度上要搜索的最大頁(yè)面數
例如，一個(gè)典型的命令如下：
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
一旦命令執行結束，就可以跳到后面的搜索部分（見(jiàn)5）。
4、全網(wǎng)爬行
全網(wǎng)爬行設計去處理非常大量的爬行，它可能要花幾個(gè)星期的時(shí)間才能完成，并起需要多臺電腦來(lái)運行它。
4.1 下載 http://rdf.dmoz.org/rdf/content.rdf.u8.gz 然后解壓解壓命令為： gunzip content.rdf.u8.gz
4.2 創(chuàng )建目錄 mkdir dmoz
4.3每搜索5000條URL記錄選擇一個(gè)存進(jìn)urls文件: bin/nutch  org.apache.nutch.tools. DmozParser content.rdf.u8 -subset 5000 > dmoz/urls
4.4 初始化crawldb: bin/nutch inject crawl/crawldb dmoz
4.5 從crawldb生成fetchlist: bin/nutch generate crawl/crawldb crawl/segments
4.6 fetchlist放置在重新創(chuàng )造的段目錄，段目錄根據指定的時(shí)間創(chuàng )建，我們保存這段變量s1:
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1 顯示結果如：crawl/segments/2006******* /*號部分表示是月日時(shí)的數字，如20060703150028
4.7 運行這段: bin/nutch fetch $s1
4.8 完成后更新數據結果: bin/nutch updatedb crawl/crawldb $s1
4.9現在數據庫的參考頁(yè)設在最初，接著(zhù)來(lái)取得新的1000頁(yè):
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch updatedb crawl/crawldb $s2
4.10 讓我們取得周?chē)母?
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch updatedb crawl/crawldb $s3
4.11 創(chuàng )建索引:
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
4.12 使用索引命令: bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
5、搜索
5.1 刪除root目錄: rm -rf ~/local/tomcat/webapps/ROOT* //.war包在webapps下會(huì )自動(dòng)解壓
5.2 拷貝文件: cp nutch*.war ~/local/tomcat/webapps/ROOT.war
5.3修改tomcat/webapps/root/WEB-INF/classes下的nutch-site.xml文件如下：
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
   <name>searcher.dir</name>
   <value>/home/crawl/nutch-0.8-dev/crawl</value> //索引的目錄
</property>
</configuration>

ps:上面說(shuō)的少了一步
3.1.2
Edit the file conf/nutch-site.xml, insert at minimum following properties into it and edit in proper values for the properties:

<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP ‘User-Agent‘ request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

and set their values appropriately.

</description>
</property>

<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>

<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
</description>
</property>

<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP ‘From‘ request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. ‘info at example dot com‘) to avoid spamming.
</description>
</property>
這樣才行。不然都是nullpointerexception.
爬不到東西。

本站僅提供存儲服務(wù)，所有內容均由用戶(hù)發(fā)布，如發(fā)現有害或侵權內容，請點(diǎn)擊舉報。

欧美性猛交XXXX免费看蜜桃,成人网18免费韩国,亚洲国产成人精品区综合,欧美日韩一区二区三区高清不卡,亚洲综合一区二区精品久久