一、修改查詢(xún)分析器
需要修改的文件是:net.nutch.analysis.NutchAnalysis.java。這個(gè)文件是從NutchAnalysis.jj使用
JavaCC自動(dòng)生成的,但是我們可以手工修改它以支持中文。
package net.nutch.analysis;
import net.nutch.searcher.Query;
import java.io.*;
/** The JavaCC-generated Nutch lexical analyzer and query parser. */
public class NutchAnalysis {
private String queryString;
/** Construct a query parser for the text in a reader. */
public static Query parseQuery(String queryString) throws IOException {
NutchAnalysis parser =
new NutchAnalysis();
parser.queryString = queryString;
return parser.parse();
}
/** For debugging. */
public static void main(String[] args) throws Exception {
String sentence ="廈門(mén)大學(xué)藝術(shù)教育學(xué)院副院長(cháng)李未明教授長(cháng)期從事音樂(lè )教學(xué),";
StringReader input= new java.io.StringReader(sentence);
BufferedReader in = new BufferedReader(input);
//while (true) {
System.out.print("Query: ");
String line = in.readLine();
System.out.println(parseQuery(line));
//}
}
/** Parse a query. */
final public Query parse() throws IOException {
Query query = new Query();
StringReader input;
input = new java.io.StringReader(queryString);
org.apache.lucene.analysis.TokenStream tokenizer = new seg.result.CnTokenizer(input);
//just a demo
for (org.apache.lucene.analysis.Token t = tokenizer.next(); t != null; t = tokenizer.next())
{
String[] array = {t.termText()};
query.addRequiredPhrase(array, t.type());
}
return query;
}
}
二、測試
在命令行執行:
>java "-Ddic.dir=D:/SSeg/Dic" -classpath D:\lucenne\lucene-1.4-final.jar;D:\SSeg\lib\seg.jar;D:\SSeg\lib\nutch.jar net.nutch.analysis.NutchAnalysis
返回的查詢(xún)對象內容是:
Query: ns:廈門(mén) n:大學(xué) n:藝術(shù) vn:教育 n:學(xué)院 b:副 n:院長(cháng) nr:李 nr:未明 n:教授 d:長(cháng)期 v:從事 n:音樂(lè ) vn:教學(xué) w:,
在unix的命令行下執行略有不同:
$java "-Ddic.dir=/home/nutch/Dic" -cp /home/nutch/lib/lucene-1.4-final.jar:/home/nutch/lib/seg.jar:/home/nutch/lib/nutch.jar net.nutch.analysis.NutchAnalysis