亚洲综合一区二区_ weblucene全文檢索應用實(shí)例研究之---www.bgo.cn

學(xué)習車(chē)東老師的作品之一weblucene(介紹在這兒http://www.chedong.com/tech/weblucene.html)的應用有一段時(shí)日了，應友人邀請，特留此作。
大家先看看效果:去www.bgo.cn瀏覽一下,再到weblucene.bgo.cn搜索一下,感覺(jué)如何呢.....

第一步，weblucene環(huán)境搭建

1.系統環(huán)境:
      redhat9.0
      j2sdk-1_4_2_08-linux-i586-rpm.bin      ant1.6.2
      tomcat5.0.30
      其中jdk,ant,tomcat的環(huán)境已經(jīng)在/etc/prfile中配置好了，ant和tomcat放在/home目錄

2.編譯weblucene源代碼

我用的是javacc版本為3.2可以去，https://javacc.dev.java.net/servlets/ProjectDocumentList 下載解壓放到/home目錄中，把目錄名改成javacc

下載 http://sourceforge.net/projects/weblucene/
解壓weblucene.tar,我放到/opt目錄中,（看一下/opt/weblucene/BUILD.txt文件，也可以先不看，等如下面所說(shuō)配置不成功再轉回來(lái)研究一下）。

設置根目錄下build.properties，
          jsdk_jar=/home/tomcat/common/lib/servlet-api.jar
          javacc.home = /home/javacc/bin
          javacc.zip.dir = ${javacc.home}/lib
          javacc.zip = ${javacc.zip.dir}/javacc.jar

   我下的weblucene版本，需要做如下設置才能ant成功，
       cd /opt/weblucene/webapp/WEB-INF/
       mkdir build
       mkdir classes
       cp -r src build
   當然，你也可以修改一下build.xml，
   然后返回weblucene根目錄也就是/opt/weblucene
      /home/ant/bin/ant build

   如果顯示
      BUILD SUCCESSFUL
      Total time: **seconds
   就編譯成功了

3.建立索引

索引試用dump/blog.xml，
    到classes目錄下
        cp ../lib/lucene.jar ../classes
       jar xvf lucene.jar
我寫(xiě)了一個(gè)index.sh(windows寫(xiě)出類(lèi)似的.bat文件)
       export LIB=/opt/weblucene/webapp/WEB-INF/lib
       export CLASSPATH=$LIB/../classes:$LIB/lucene.jar:$LIB/xercesImpl.jar:$LIB/log4j.jar:$LIB/java-getopt.jar:$/LIB/jdom.jar:xalan.jar:
      cd /opt/weblucene/webapp/WEB-INF/classes
       java IndexRunner -i /opt/weblucene/dump/blog.xml o /opt/weblucene/webapp/WEB-INF/var/blog

執行./index.sh

    如果顯示
       [main] INFO IndexRunner - Great! Indexing OK
    說(shuō)明索引創(chuàng )建成功

4.初始化服務(wù)器
把你的貓（tomcat）缺省路徑設置為/opt/weblucene/webapp/

在WEB-INF/conf建一個(gè)文件appname.conf,我在里面什么都沒(méi)寫(xiě)

啟動(dòng) tomcat

打開(kāi)瀏覽器，輸入http://你的IP地址/search.html

   頁(yè)面打開(kāi)后，輸入一個(gè)詞（最好是blog.xml文件中有的），就可以看到一些結果了

    另:在jdk1.4.2以下,或者tomcat5以下,需要把weblucene/webapp/WEB-INF/lib/xalan.jar文件拷貝到${tomcat}/common/endorsed目錄下.
    附:在windows下面,只要相對應的路徑設置正確就同樣可以運行成功,也可以來(lái)email共同探討.

第二步為自己的站點(diǎn)設置索引。車(chē)東老師是用php寫(xiě)的，weblucene中有代碼，我用JAVA吧。

1. 請閱讀http://www.chedong.com/tech/weblucene.html,建立索引的一個(gè)步驟需要您為自己站點(diǎn)的信息生成一個(gè)xml文件。

     哪怎么樣生成呢？
我們先來(lái)考查一下blog.xml文件的格式吧!這是一段樣例
     <?xml version="1.0" encoding="GB2312"?>
       <Table>
        <Record id="1">
<Field name="Url">http://www.javaws.com/snipsnap/space/2003-12-14#SnipSnap@</Field>
<Field name="Title">SnipSnap@(Jetty+Apache2+mod_jk2+Mysql+Tomcat4.1) Vol.2</Field>
<Field name="Author">_~j.h.S.e.A.3.D.o.hkjh^^.S.A.D.~_</Field>
<Field name="Content">設置好了Apache。 </Field>
<Field name="PubTime">2003-12-20 21:50:20</Field>
<Index name="FullIndex">Title,Content,Author</Index>
        </Record>

table:全局開(kāi)關(guān)
record:索引的單元，簡(jiǎn)單的說(shuō)就是需要索引的每一個(gè)頁(yè)面記錄吧，id字段我還沒(méi)有完全研究清楚，應該和Field name="PubTime"配合使用。
Field name="Url"：就是需要鏈接的頁(yè)面地址，動(dòng)態(tài)的靜態(tài)的都行。
Field name="Title"：這個(gè)嗎，bgo的每一個(gè)信息頁(yè)面都有一個(gè)標題，比較容易給出，你用不用這個(gè)字段都行,用html標記中的頭部信息也行。
Field name="Author"：不解釋了
Field name="Content"：就是這個(gè)頁(yè)面的所有信息了,bgo都有相應的字段對應了，呵呵。
Field name="PubTime"：這個(gè)就是頁(yè)面生成的時(shí)間了，weblucene檢索有一個(gè)按時(shí)間排序，如果檢索結果想要靠前，生成的時(shí)間，就要晚，
Record id這個(gè)屬性也盡量靠后。bgo有一個(gè)發(fā)布時(shí)間啦，按照這個(gè)字段，按照升序排列就可以，呵呵。
<Index name="FullIndex">Title,Content,Author</Index>：這一句話(huà)就是看你需要為哪些字段建立索引了，Url,PubTime屬性沒(méi)有必要撒,Title,Content,Author你沒(méi)有建立哪一個(gè)就不要哪一個(gè)，空標記也行。

分析完了，不知道大家搞清楚了沒(méi)，

生成xml，我碰到難一點(diǎn)的地方在于，需要過(guò)濾一些XML不能識別的字符，要一些java正則表達式的知識。其它的，Field元素中每一個(gè)屬性bgo就有一個(gè)字段相對應，還是蠻方便的。
因為bgo建立索引是直接從數據庫提出來(lái)的,按照發(fā)布時(shí)間排序，生成對應叫AdInfo的bean中，名字,然后放在List容器中，這一部分就不開(kāi)源了，最后是直接對List行操作.
代碼如下:

/**
* Date: 2005-8-30
* Time: 9:37:49
* @author Tao J
*/
public class SnatchAdInfo {

public static void main(String[] args) {

String path = args[0];

List result=null;
//從數據庫中提取省略了

try {
PrintWriter pw = new PrintWriter(new BufferedWriter(new OutputStreamWriter(new FileOutputStream(path))));

pw.println("<?xml version=\"1.0\" encoding=\"GBK\"?>");
pw.println("<Table>");

for (int i = 0; i < result.size(); i++) {

int j = i + 1;

                AdInfo info = (AdInfo) result.get(i);
                pw.println("<Record id=\"" + j + "\">");
                pw.println("<Field name=\"Url\">http://www.bgo.cn/" + new SimpleDateFormat("yyyyMMdd").format

(info.getPublishTime()) + "/" + new SimpleDateFormat("HHmmss").format(info.getPublishTime()) + info.getId() + ".html" +

"</Field>");

//bgo的信息頭部有時(shí)候會(huì )有三個(gè)點(diǎn)，需要過(guò)濾掉
                String tmp = "\\.\\.\\.";
                pw.println("<Field name=\"Title\">" + convert(info.getTitle().replaceFirst(tmp, "")) + "</Field>");
                if (null == info.getLinkman()) {
                    pw.println("<Field name=\"Author\"></Field>");
                } else {
                    pw.println("<Field name=\"Author\">" + info.getLinkman().replaceAll("&", "&") + "</Field>");
                }
                pw.println("<Field name=\"Content\">" + convert(info.getContent()) + "</Field>");
                pw.println("<Field name=\"PubTime\">" + info.getPublishTime() + "</Field>");
                pw.println("<Index name=\"FullIndex\">Title,Content,Author</Index>");
                pw.println("</Record>");
            }

pw.println("</Table>");

pw.close();

        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }

}

    /**
     * 過(guò)濾HTML和其他非法代碼的問(wèn)題
     * @param context
     * @return
     */
    protected final static String convert(String context) {
        try {

            //這一部分是過(guò)濾html標記的 ,但還不是很完善
            context = replace(context, "<[\\w]{1,5}[\\s]+([\\w]*=[\"]?[^\"]*[\"]?[\\s]*)+[\\w]*>", "");
            context = replace(context, "</[\\w]{1,6}", "");
            context = replace(context, "<[\\w]{1,6}>", "");

            context = replace(context, "\\ ", "");
            context = replace(context, "<", "&lt");
            context = replace(context, ">", "&gt");
            context = replace(context, "[\\s]", " ");
            context = replace(context, "&", "&");

            //這一部分是過(guò)濾XML不能識別特殊字符,
            context = StringUtil.replace(context, "[\\x00]", "");
            context = StringUtil.replace(context, "[\\x01]", "");
            context = StringUtil.replace(context, "[\\x02]", "");
            context = StringUtil.replace(context, "[\\x03]", "");
            context = StringUtil.replace(context, "[\\x04]", "");
            context = StringUtil.replace(context, "[\\x05]", "");
            context = StringUtil.replace(context, "[\\x06]", "");
            context = StringUtil.replace(context, "[\\x07]", "");
            context = StringUtil.replace(context, "[\\x08]", "");

context = StringUtil.replace(context, "[\\x0b]", "");
context = StringUtil.replace(context, "[\\x0c]", "");

            context = StringUtil.replace(context, "[\\x0e]", "");
            context = StringUtil.replace(context, "[\\x11]", "");
            context = StringUtil.replace(context, "[\\x12]", "");
            context = StringUtil.replace(context, "[\\x13]", "");
            context = StringUtil.replace(context, "[\\x14]", "");
            context = StringUtil.replace(context, "[\\x15]", "");
            context = StringUtil.replace(context, "[\\x16]", "");
            context = StringUtil.replace(context, "[\\x17]", "");
            context = StringUtil.replace(context, "[\\x18]", "");
            context = StringUtil.replace(context, "[\\x19]", "");
            context = StringUtil.replace(context, "[\\x1a]", "");
            context = StringUtil.replace(context, "[\\x1b]", "");
            context = StringUtil.replace(context, "[\\x1c]", "");
            context = StringUtil.replace(context, "[\\x1d]", "");
            context = StringUtil.replace(context, "[\\x1e]", "");
            context = StringUtil.replace(context, "[\\x1f]", "");

        } catch (IndexOutOfBoundsException e) {
            System.out.println(e.toString());
        }
        return context;
    }

protected final static String replace(String master, String regx, String repl) {

       Pattern p;
        p = Pattern.compile(regx);
        Matcher m;
        m = p.matcher(master);
        return m.replaceAll(repl);
    }
}
其中arg[0]為xml文件的路徑.

然后, 修改index.sh文件:
       把這句話(huà)java IndexRunner -i /opt/weblucene/dump/blog.xml -o /opt/weblucene/webapp/WEB-INF/var/blog
       改為java IndexRunner -i 您xml文件存放的路徑 -o /opt/weblucene/webapp/WEB-INF/var/blog

        重新運行./index.sh
        一切就OK了.

第三部,添加索引。未完待續

本站僅提供存儲服務(wù)，所有內容均由用戶(hù)發(fā)布，如發(fā)現有害或侵權內容，請點(diǎn)擊舉報。

欧美性猛交XXXX免费看蜜桃,成人网18免费韩国,亚洲国产成人精品区综合,欧美日韩一区二区三区高清不卡,亚洲综合一区二区精品久久