<property>
<name>http.max.delays</name>
<value>20</value>
<description>The number of times a thread will delay when trying to
fetch a page. Each time it finds that a host is busy, it will wait
fetcher.server.delay. After http.max.delays attepts, it will give
up on the page for now.</description>
</property>
爬蟲(chóng)的網(wǎng)絡(luò )延時(shí)線(xiàn)程等待時(shí)間,以秒計時(shí) , 默認的配時(shí)間是3秒,視網(wǎng)絡(luò )狀況而定。如果在爬蟲(chóng)運行的時(shí)候發(fā)現服務(wù)器返回了主機忙消息,則等待時(shí)間由fetcher.server.delay 決定,所以在網(wǎng)絡(luò )狀況不太好的情況下fetcher.server.delay 也設置稍大一點(diǎn)的值較好,此外還有一個(gè)http.timeout 也和網(wǎng)絡(luò )狀況有關(guān)系。
<property>
<name>http.content.limit</name>
<value>65536</value>
<description>The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.
</description>
</property>
描述爬蟲(chóng)抓取的文檔內容長(cháng)度的配置項。原來(lái)的值是 65536 , 也就是說(shuō)抓取到的一個(gè)文檔截取 65KB左右,超過(guò)部分將被忽略,對于抓取特定內容的搜索引擎需要修改此項,比如XML文檔。
<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>The default number of days between re-fetches of a page.
</description>
</property>
這個(gè)功能對定期自動(dòng)爬取需求的開(kāi)發(fā)有用,設置多少天重新爬一個(gè)頁(yè)面。
<property>
<name>fetcher.server.delay</name>
<value>5.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>10</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection).</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>1</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
<property>
<name>fetcher.verbose</name>
<value>false</value>
<description>If true, fetcher will log more verbosely.</description>
</property>
這幾個(gè)是爬蟲(chóng)線(xiàn)程的配置項,看名稱(chēng)就容易理解。
<property>
<name>parser.threads.parse</name>
<value>10</value>
<description>Number of ParserThreads ParseSegment should use.</description>
</property>
解析爬取到的文檔線(xiàn)程數, 和爬蟲(chóng)線(xiàn)程對應,因為爬蟲(chóng)主要的處理類(lèi)是有很多地方使用到了同步,所以此配置項和爬蟲(chóng)線(xiàn)程保持一直對處理有好處。
<property>
<name>fs.default.name</name>
<value>local</value>
<description>The name of the default file system. Either the
literal string "local" or a host:port for NDFS.</description>
</property>
分布式文件系統 使用的配置項,默認的是 local 表示 使用本地文件系統,如果使用 host:port 的形式表示使用分布式文件系統NDFS,此處的文件系統地址是 nameserver ,也就是通過(guò) bin/nutch namenode xxxx啟動(dòng)的 主機地址和端口號。
<property>
<name>ndfs.name.dir</name>
<value>/tmp/nutch/ndfs/name</value>
<description>Determines where on the local filesystem the NDFS name node
should store the name table.</description>
</property>
分布式文件系統namenode 使用的 的存放數據的目錄 ,Namenode 會(huì )使用此項,另外在啟動(dòng) namenode 或datanode的時(shí)候也可以加上 路徑作為參數也可以生效。
<property>
<name>ndfs.data.dir</name>
<value>/tmp/nutch/ndfs/data</value>
<description>Determines where on the local filesystem an NDFS data node
should store its blocks.</description>
</property>
分布式文件系統ndatanode 使用的 的存放數據的目錄 ,datanode 會(huì )使用此項,另外在啟動(dòng) datanode的時(shí)候也可以加上 路徑作為參數也可以生效。
<property>
<name>indexer.max.tokens</name>
<value>10000</value>
<description>
The maximum number of tokens that will be indexed for a single field
in a document. This limits the amount of memory required for
indexing, so that collections with very large files will not crash
the indexing process by running out of memory.
Note that this effectively truncates large documents, excluding
from the index tokens that occur further in the document. If you
know your source documents are large, be sure to set this value
high enough to accomodate the expected size. If you set it to
Integer.MAX_VALUE, then the only limit is your memory, but you
should anticipate an OutOfMemoryError.
</description>
</property>
這個(gè)配置項的功能是限制索引的時(shí)候每個(gè)文檔的單個(gè)字段最大10000個(gè)Tokens,也就是說(shuō)在采用默認的一元分詞的 情況下,最大的文檔字數限制是10000,如果采用其他中文非一元分詞系統,則索引的單個(gè)文檔單個(gè)字段將會(huì )超過(guò)10000個(gè),對內存有影響。
<property>
<name>indexer.mergeFactor</name>
<value>200</value>
<description>The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
</description>
</property>
合并因子,在建立索引的時(shí)候用到,表示索引200個(gè)文檔的時(shí)候回寫(xiě)到存儲設備。
<property>
<name>indexer.minMergeDocs</name>
<value>50</value>
<description>This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
</description>
</property>
這個(gè)配置項對內存影響巨大,功能是在建立索引的時(shí)候最小的合并文檔數量,這個(gè)值設置太小一個(gè)會(huì )影響索引速度,另外一個(gè)在需要索引的文檔數量很大的時(shí)候會(huì )出現 Too Many Open files 的錯誤,這個(gè)時(shí)候需要調整此配置項,有試驗表明1000的時(shí)候會(huì )有比較快的索引速度,但我把此項值調整到10000 , 索引的時(shí)候最高內存占用到1.8G,索引創(chuàng )建速度是25page/sec , 并且多次索引的時(shí)候有一個(gè)衰減。不過(guò)對查詢(xún)的相應時(shí)間有很大提升,如果內存足夠的話(huà)修改較大的值比較好。
<property>
<name>indexer.maxMergeDocs</name>
<value>50</value>
<description>This number determines the maximum number of Lucene
Documents to be merged into a new Lucene segment. Larger values
increase indexing speed and reduce the number of Lucene segments,
which reduces the number of open file handles; however, this also
increases RAM usage during indexing.
</description>
</property>
這個(gè)好像不需要設置,因為默認的值是 Integer.MAX_VALUE , 不會(huì )比這個(gè)更大了。
<property>
<name>searcher.summary.context</name>
<value>5</value>
<description>
The number of context terms to display preceding and following
matching terms in a hit summary.
</description>
</property>
這個(gè)比較有用,在前面的文章里有介紹。
<property>
<name>searcher.summary.length</name>
<value>20</value>
<description>
The total number of terms to display in a hit summary.
</description>
</property>
在前面的文章里也有介紹。
<property>
<name>plugin.folders</name>
<value>plugins</value>
<description>Directories where nutch plugins are located. Each
element may be a relative or absolute path. If absolute, it is used
as is. If relative, it is searched for on the classpath.</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(text|html|rss)|index-more|query-(basic|site|url)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>
這兩個(gè)是配置 插件功能 的配置項 ,plugin.folders制定插件加載路徑,plugin.includes表示需要加載的插件列表,關(guān)于插件后面會(huì )專(zhuān)門(mén)做介紹。
<property>
<name>parser.character.encoding.default</name>
<value>windows-1252</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
解析文檔的時(shí)候使用的默認編碼windows-1252 好像比較少用到的一種編碼,我不太熟悉。
<property>
<name>parser.html.impl</name>
<value>neko</value>
<description>HTML Parser implementation. Currently the following keywords
are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
</description>
</property>
制定解析HTML文檔的時(shí)候使用的解析器, NEKO功能比較強大,后面會(huì )有專(zhuān)門(mén)的文章介紹Neko 從HTML到 TEXT以及html片斷的解析等功能做介紹。
<property>
<name>extension.clustering.hits-to-cluster</name>
<value>100</value>
<description>Number of snippets retrieved for the clustering extension
if clustering extension is available and user requested results
to be clustered.</description>
</property>
聚合功能,對搜索結果有聚合需求的應用可能會(huì )用到。
<property>
<name>extension.ontology.extension-name</name>
<value></value>
<description>Use the specified online ontology extension. If empty,
the first available extension will be used. The "name" here refers to an ‘id‘
attribute of the ‘implementation‘ element in the plugin descriptor XML
file.</description>
</property>
人工智能,這個(gè)功能在我以后的開(kāi)發(fā)過(guò)程中會(huì )逐步深入,等我有相關(guān)的經(jīng)驗以后在給大家介紹?!騙◎
<property>
<name>query.url.boost</name>
<value>4.0</value>
<description> Used as a boost for url field in Lucene query.
</description>
</property>
<property>
<name>query.anchor.boost</name>
<value>2.0</value>
<description> Used as a boost for anchor field in Lucene query.
</description>
</property>
<property>
<name>query.title.boost</name>
<value>1.5</value>
<description> Used as a boost for title field in Lucene query.
</description>
</property>
<property>
<name>query.host.boost</name>
<value>2.0</value>
<description> Used as a boost for host field in Lucene query.
</description>
</property>
<property>
<name>query.phrase.boost</name>
<value>1.0</value>
<description> Used as a boost for phrase in Lucene query.
Multiplied by boost for field phrase is matched in.
</description>
</property>
以上的幾個(gè)關(guān)于搜索結果排序的分值計算因子在以后的搜索結果排序會(huì )專(zhuān)門(mén)做介紹,這幾個(gè)項對垂直搜索的用處不太大。
<property>
<name>lang.analyze.max.length</name>
<value>2048</value>
<description> The maximum bytes of data to uses to indentify
the language (0 means full content analysis).
The larger is this value, the better is the analysis, but the
slowest it is.
</description>
</property>
和語(yǔ)言有關(guān)系,分詞的時(shí)候會(huì )用到,不過(guò)我沒(méi)用過(guò)這個(gè)配置項。
還有幾個(gè)重要的配置項在 nutch-site.xml里面配置。
<property>
<name>searcher.dir</name>
<value>C:\</value>
</property>
可以有兩種方式 ,如果指向的目錄下面有 search-servers.txt 文件 ,那么優(yōu)先處理 search-servers.txt文件中的內容 ,并解析其中復合
hostname port
格式的內容(即分布式查詢(xún)請求),解析到后就想該服務(wù)器發(fā)送查詢(xún)請求,如果沒(méi)有則查找 segements 目錄 ,segments 是本地索引文件。
如果兩個(gè)都沒(méi)有找到,她就要報錯了。
search-servers.txt
內容很簡(jiǎn)單 例如:
127.0.0.1 9999
不過(guò)需要注意的是 ,這個(gè) 9999的端口 啟動(dòng)的 是查詢(xún)服務(wù)器 ,是用 bin/nutch server 9999 的命令啟動(dòng)的,
和 namenode 啟動(dòng)比較相似 ,我當初接觸的時(shí)候就以為是 namenode 的地址,郁悶的很久。
namenode 和 searchserver 結合不太好 , 沒(méi)有提供直接從 namenode 到searchserver 的 文件訪(fǎng)問(wèn)接口,需要自己開(kāi)發(fā),如果大
家知道有可以直接從 namenode 到searchserver 的方法或者現成的程序,請告訴我一下,我需要,要是實(shí)在找不到,那就沒(méi)辦法了,自己寫(xiě)。
我現在從namenode 到 searchserver的方法比較原始,不值得推薦,所以就不作介紹了。


