This was written for the 0.7 branch. For an example using the 0.8 code, see this page
The Example
Consider this as a plugin example: We want to be able to recommend specific web pages for given search terms. For this example we‘ll assume we‘re indexing this site. As you may have noticed, there are a number of pages that talk about plugins. What we want to do is have it so that if someone searches for the term "plugin" we recommend that they start at the PluginCentral page, but we also want to return all the normal hits in the expected ranking. We‘ll seperate the search results page into a section of recommendations and then a section with the normal search results.
You go through your site and add meta-tags to pages that list what terms they should be recommended for. The tags look something like this:
<meta name="recommended" content="plugins" />
In order to do this we need to write a plugin that extends 3 different extension points. We need to extend the HTMLParser in order to get the recommended terms out of the meta tags. The IndexingFilter will need to be extended to add a recommended field to the index. The QueryFilter needs to be extended to add the ability to search againsed the new field in the index.
Setup
Start by 
Use the source code for the plugins distrubuted with Nutch as a reference. They‘re in [YourCheckoutDir]/src/plugin.
For the example we‘re going to assume that this plugin is something we want to contribute back to the Nutch community, so we‘re going to use the directory/package structure of "org/apache/nutch". If you‘re writing a plugin solely for the use of your organization you‘d want to replace that with something like "org/my_organization/nutch".
Required Files
You‘re going to need to create a directory inside of the plugin directory with the name of your plugin (‘recommended‘ in this case) and inside that directory you need the following:
-
A plugin.xml file that tells nutch about your plugin.
-
A build.xml file that tells ant how to build your plugin.
-
The source code of your plugin in the directory structure recommended/src/java/org/apache/nutch/parse/recommended/[Source_Here].
Plugin.xml
Your plugin.xml file should look like this:
<?xml version="1.0" encoding="UTF-8"?><pluginid="recommended"name="Recommended Parser/Filter"version="0.0.1"provider-name="nutch.org"><runtime><!-- As defined in build.xml this plugin will end up bundled as recommended.jar --><library name="recommended.jar"><export name="*"/></library></runtime><!-- The RecommendedParser extends the HtmlParseFilter to grab the contents ofany recommended meta tags --><extension id="org.apache.nutch.parse.recommended.recommendedfilter"name="Recommended Parser"point="org.apache.nutch.parse.HtmlParseFilter"><implementation id="RecommendedParser"class="org.apache.nutch.parse.recommended.RecommendedParser"/></extension><!-- TheRecommendedIndexer extends the IndexingFilter in order to add the contentsof the recommended meta tags (as found by the RecommendedParser) to the luceneindex. --><extension id="org.apache.nutch.parse.recommended.recommendedindexer"name="Recommended identifier filter"point="org.apache.nutch.indexer.IndexingFilter"><implementation id="RecommendedIndexer"class="org.apache.nutch.parse.recommended.RecommendedIndexer"/></extension><!-- The RecommendedQueryFilter gets called when you perform a search. It runs asearch for the user‘s query against the recommended fields. In order to getadd this to the list of filters that gets run by default, you have to use"fields=DEFAULT". --><extension id="org.apache.nutch.parse.recommended.recommendedSearcher"name="Recommended Search Query Filter"point="org.apache.nutch.searcher.QueryFilter"><implementation id="RecommendedQueryFilter"class="org.apache.nutch.parse.recommended.RecommendedQueryFilter"fields="DEFAULT"/></extension></plugin>
Build.xml
In its simplest form:
<?xml version="1.0"?><project name="recommended" default="jar"><import file="../build-plugin.xml"/></project>
The HTML Parser Extension
This is the source code for the HTML Parser extension. It tries to grab the contents of the recommended meta tag and add them to the document being parsed.
package org.apache.nutch.parse.recommended;// JDK importsimport java.util.Enumeration;import java.util.Properties;import java.util.logging.Logger;// Nutch importsimport org.apache.nutch.parse.HTMLMetaTags;import org.apache.nutch.parse.Parse;import org.apache.nutch.parse.HtmlParseFilter;import org.apache.nutch.protocol.Content;import org.apache.nutch.util.LogFormatter;public class RecommendedParser implements HtmlParseFilter {private static final Logger LOG = LogFormatter.getLogger(RecommendedParser.class.getName());/** The Recommended meta data attribute name */public static final String META_RECOMMENDED_NAME="Recommended";/*** Scan the HTML document looking for a recommended meta tag.*/public Parse filter(Content content, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc) {// Trying to find the document‘s recommended termString recommendation = null;Properties generalMetaTags = metaTags.getGeneralTags();for (Enumeration tagNames = generalMetaTags.propertyNames(); tagNames.hasMoreElements(); ) {if (tagNames.nextElement().equals("recommended")) {recommendation = generalMetaTags.getProperty("recommended");LOG.info("Found a Recommendation for " + recommendation);}}if (recommendation == null) {LOG.info("No Recommendataion");} else {LOG.info("Adding Recommendation for " + recommendation);parse.getData().getMetadata().put(META_RECOMMENDED_NAME, recommendation);}return parse;}}The Indexer Extension
The following is the code for the Indexing Filter extension. If the document being indexed had a recommended meta tag this extension adds a lucene text field to the index called "recommended" with the content of that meta tag.
package org.apache.nutch.parse.recommended;// JDK importimport java.util.logging.Logger;// Nutch importsimport org.apache.nutch.util.LogFormatter;import org.apache.nutch.fetcher.FetcherOutput;import org.apache.nutch.indexer.IndexingFilter;import org.apache.nutch.indexer.IndexingException;import org.apache.nutch.parse.Parse;// Lucene importsimport org.apache.lucene.document.Field;import org.apache.lucene.document.Document;public class RecommendedIndexer implements IndexingFilter {public static final Logger LOG= LogFormatter.getLogger(RecommendedIndexer.class.getName());public RecommendedIndexer() {}public Document filter(Document doc, Parse parse, FetcherOutput fo)throws IndexingException {String recommendation = parse.getData().get("Recommended");if (recommendation != null) {Field recommendedField =new Field("recommended", recommendation, Field.Store.YES, Field.Index.UN_TOKENIZED);recommendedField.setBoost(5.0f);doc.add(recommendedField);LOG.info("Added " + recommendation + " to the recommended Field");}return doc;}}The QueryFilter
The QueryFilter gets called when the user does a search. We‘re bumping up the boost for the recommended field in order to increase its influence on the search results.
package org.apache.nutch.parse.recommended;import org.apache.nutch.searcher.FieldQueryFilter;import java.util.logging.Logger;import org.apache.nutch.util.LogFormatter;public class RecommendedQueryFilter extends FieldQueryFilter {private static final Logger LOG = LogFormatter.getLogger(RecommendedParser.class.getName());public RecommendedQueryFilter() {super("recommended", 5f);LOG.info("Added a recommended query");}}Getting Nutch to Use Your Plugin
In order to get Nutch to use your plugin, you need to edit your conf/nutch-site.xml file and add in a block like this:
<property><name>plugin.includes</name><value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value><description>Regular expression naming plugin directory names toinclude. Any plugin not matching this expression is excluded.In any case you need at least include the nutch-extensionpoints plugin. Bydefault Nutch includes crawling just HTML and plain text via HTTP,and basic indexing and search plugins.</description></property>
You‘ll want to edit the regular expression so that it includes the name of your plugin.
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|recommended</value>
Getting Ant to Compile Your Plugin
In order for ant to compile and deploy your plugin you need to edit the src/plugin/build.xml file (NOT the build.xml in the root of your checkout directory). You‘ll see a number of lines that look like
<ant dir="[plugin-name]" target="deploy" />
Edit this block to add a line for your plugin before the </target> tag.
<ant dir="reccomended" target="deploy" />
Running ‘a(chǎn)nt‘ in the root of your checkout directory should get everything compiled and jared up. The next time you run a crawl your parser and index filter should get used.
You‘ll need to run ‘a(chǎn)nt war‘ to compile a new ROOT.war file. Once you‘ve deployed that, your query filter should get used when searches are performed.
<<< See also: HowToContribute
<<< PluginCentral

