一、引言:
Heritrix3.x與Heritrix1.x版本差異比較大,全新配置模式的引入+擴展接口的變化,同時(shí)由于說(shuō)明文檔的匱乏,給Heritrix的開(kāi)發(fā)者帶來(lái)困惑,前面的文章已經(jīng)就Heritrix的配置部署和運行做了說(shuō)明,本文就Heritrix3.x版本就Extractor擴展做出實(shí)例說(shuō)明。
二、配置說(shuō)明
Heritrix3.x的WebUI發(fā)生了變化,不在是原來(lái)那種WebUI選擇模式,而是變成了在線(xiàn)配置文件直接編輯模式。在這里自定義的Extractor要想加入Heritrix運行,首先需要修改配置文件,降自定義擴展的Extractor加入到Heritrix的Processor隊列。完整配置文件如下所示:
2.1 配置文件
205 <!-- FETCH CHAIN --> 206 <!-- processors declared as named beans -->207 <bean id="preselector" class="org.archive.crawler.prefetch.Preselector">212 </bean>213 <bean id="preconditions" class="org.archive.crawler.prefetch.PreconditionEnforcer">217 </bean>218 <bean id="fetchDns" class="org.archive.modules.fetcher.FetchDNS">222 </bean>223 <bean id="fetchHttp" class="org.archive.modules.fetcher.FetchHTTP">249 </bean>250 <bean id="extractorHttp" class="org.archive.modules.extractor.ExtractorHTTP">251 </bean>
-------------------------------自定義Extractor------------------------------------252 <bean id="SohuNewsExtractor" class="my.SohuNewsExtractor">253 </bean>
---------------------------------------------------------------------------------
254 <bean id="extractorHtml" class="org.archive.modules.extractor.ExtractorHTML">264 </bean>265 <bean id="extractorCss" class="org.archive.modules.extractor.ExtractorCSS">266 </bean> 267 <bean id="extractorJs" class="org.archive.modules.extractor.ExtractorJS">268 </bean>269 <bean id="extractorSwf" class="org.archive.modules.extractor.ExtractorSWF">270 </bean> 271 <!-- assembled into ordered FetchChain bean -->272 <bean id="fetchProcessors" class="org.archive.modules.FetchChain">273 <property name="processors">274 <list>275 <!-- recheck scope, if so enabled... -->276 <ref bean="preselector"/>277 <!-- ...then verify or trigger prerequisite URIs fetched, allow crawling... -->278 <ref bean="preconditions"/>279 <!-- ...fetch if DNS URI... -->280 <ref bean="fetchDns"/>281 <!-- ...fetch if HTTP URI... -->282 <ref bean="fetchHttp"/>283 <!-- ...extract oulinks from HTTP headers... -->284 <ref bean="extractorHttp"/>
----------------------------自定義Extractor----------------------------------------------285 <!-- ...extract oulinks from HTTP content... -->286 <ref bean="SohuNewsExtractor"/>
---------------------------------------------------------------------------------------
287 <!-- ...extract oulinks from HTML content... -->288 <ref bean="extractorHtml"/>289 <!-- ...extract oulinks from CSS content... -->290 <ref bean="extractorCss"/>291 <!-- ...extract oulinks from Javascript content... -->292 <ref bean="extractorJs"/>293 <!-- ...extract oulinks from Flash content... -->294 <ref bean="extractorSwf"/>295 </list>296 </property>297 </bean>298
2.2 添加Bean和配置調度列表
250 <bean id="extractorHttp" class="org.archive.modules.extractor.ExtractorHTTP">251 </bean>
-------------------------------自定義Extractor------------------------------------252 <bean id="SohuNewsExtractor" class="my.SohuNewsExtractor">253 </bean>
---------------------------------------------------------------------------------
...
----------------------------自定義Extractor---------------------------------------285 <!-- ...extract oulinks from HTTP content... -->286 <ref bean="SohuNewsExtractor"/>
---------------------------------------------------------------------------------
配置完成以上部分,既可以實(shí)現自定義Extractor參與Processor任務(wù)處理的調度。
三、程序說(shuō)明
3.1 Extractor基類(lèi)
Extractor基類(lèi)發(fā)生了變化,新增了新的接口方法:
1 @Override2 protected boolean shouldProcess(CrawlURI uri) {3 // TODO Auto-generated method stub4 return false;5 }
如果不實(shí)現此方法,自定義擴展的Extractor的函數void extract(CrawlURI uri)將不會(huì )被調度。
3.2 構造函數
1.x版本的構造函數如下:
public Extractor(String name, String description) { super(name, description); // TODO Auto-generated constructor stub }
3.x版本的構造函數取消了參數,采用的默認構造函數。
四、遺留問(wèn)題
protected void extract(CrawlURI curi)
{
//1. 做哪些處理?
//2. 如何控制后續的下載行為,要求只下載自己想要的內容
}
聯(lián)系客服