Lucene

Doug Cutting
cutting@apache.org

November 24 2004
University of Pisa

Prelude

my background..
please interrupt with questions
blog this talk now so that we can search for it later
(using a Lucene-based blog search engine, of course)
In this course, Paolo and Antonio have presented many techniques.
I present real software that uses many of these techniques.

Lucene is

software library for search
open source
not a complete application
set of java classes
active user and developer communities
widely used, e.g, IBM and Microsoft.

Lucene Architecture

[draw on whiteboard for reference throughout talk]

Lucene API

org.apache.lucene.document
org.apache.lucene.analysis
org.apache.lucene.index
org.apache.lucene.search

Package: org.apache.lucene.document

A Document is a sequence of Fields.
A Field is a <name, value> pair.
- name is the name of the field, e.g., title, body, subject, date, etc.
- value is text.
Field values may be stored, indexed or analyzed (and, now, vectored).

Example

 public Document makeDocument(File f) throws FileNotFoundException {
    Document doc = new Document();
    doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));

    doc.add(new Field("modified",
        DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
        Store.YES, Index.UN_TOKENIZED));

    // Reader implies Store.NO and Index.TOKENIZED
    doc.add(new Field("contents", new FileReader(f)));

    return doc;
  }

Example (continued)

field	stored	indexed	analyzed
path	yes	yes	yes
modified	yes	yes	no
content	no	yes	yes

Package: org.apache.lucene.analysis

An Analyzer is a TokenStream factory.
A TokenStream is an iterator over Tokens.
- input is a character iterator (Reader)
A Token is tuple <text, type, start, length, positionIncrement>
- text (e.g., “pisa”).
- type (e.g., “word”, “sent”, “para”).
- start & length offsets, in characters (e.g, <5,4>)
- positionIncrement (normally 1)
standard TokenStream implementations are
- Tokenizers, which divide characters into tokens and
- TokenFilters, e.g., stop lists, stemmers, etc.

Example

public class ItalianAnalyzer extends Analyzer {

  private Set stopWords = 
    StopFilter.makeStopSet(new String[] {"il", "la", "in"};

  public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new WhitespaceTokenizer(reader);
    result = new LowerCaseFilter(result);
    result = new StopFilter(result, stopWords);
    result = new SnowballFilter(result, "Italian");
    return result;
 }
}

Package: org.apache.lucene.index

Term is <fieldName, text>
index maps Term → <df, <docNum, <position>* >*>
e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>
new: term vectors!

Example

IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
  writer.addDocument(makeDocument(files[i]));
}
writer.close();

Some Inverted Index Strategies

batch-based: use file-sorting algorithms (textbook)
b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)
segment based: lots of small indexes (Verity)

Lucene‘s Index Algorithm

two basic algorithms:
1. make an index for a single document
2. merge a set of indices

incremental algorithm:
- maintain a stack of segment indices
- create index for each incoming document
- push new indexes onto the stack
- let b=10 be the merge factor; M=∞
  for (size = 1; size < M; size *= b) {
  if (there are b indexes with size docs on top of the stack) {
     pop them off the stack;
     merge them into a single index;
     push the merged index onto the stack;
  } else {
     break;
  }
  }

optimization: single-doc indexes kept in RAM, saves system calls
notes:
- average b*log_b(N)/2 indexes
  - N=1M, b=2 gives just 20 indexes
  - fast to update and not too slow to search
- batch indexing w/ M=∞, merge all at end
  - equivalent to external merge sort, optimal
- segment indexing w/ M<∞

Indexing Diagram

b = 3
11 documents indexed
stack has four indexes
grayed indexes have been deleted
5 merges have occurred

Index Compression

For keys in Term -> ... map, use technique from Paolo‘s slides:

For values in Term -> ... map, use technique from Paolo‘s slides:

VInt Encoding Example

Value	First byte	Second byte	Third byte
0	00000000
1	00000001
2	00000010
...
127	01111111
128	10000000	00000001
129	10000001	00000001
130	10000010	00000001
...
16,383	11111111	01111111
16,384	10000000	10000000	00000001
16,385	10000001	10000000	00000001
...

This provides compression while still being efficient to decode.

Package: org.apache.lucene.search

primitive queries:
- TermQuery: match docs containing a Term
- PhraseQuery: match docs w/ sequence of Terms
- BooleanQuery: match docs matching other queries.
  e.g., +path:pisa +content:“Doug Cutting” -path:nutch
new: SpansQuery
derived queries:
- PrefixQuery, WildcardQuery, etc.

Example

Query pisa = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel"));

PhraseQuery leaningTower = new PhraseQuery();
leaningTower.add(new Term("content", "leaning"));
leaningTower.add(new Term("content", "tower"));

BooleanQuery query = new BooleanQuery();
query.add(leaningTower, Occur.MUST);
query.add(pisa, Occur.SHOULD);
query.add(babel, Occur.MUST_NOT);

Search Algorithms

From Paolo‘s slides:

Lucene‘s Disjunctive Search Algorithm

described in http://lucene.sf.net/papers/riao97.ps
since all postings must be processed
- goal is to minimize per-posting computation
merges postings through a fixed-size array of accumulator buckets
performs boolean logic with bit masks
scales well with large queries

[ draw a diagram to illustrate? ]

Lucene‘s Conjunctive Search Algorithm

From Paolo‘s slides:

Algorithm

use linked list of pointers to doc list
initially sorted by doc
loop
- if all are at same doc, record hit
- skip first to-or-past last and move to end of list

Scoring

From Paolo‘s slides:

Is very much like Lucene‘s Similarity.

Lucene‘s Phrase Scoring

approximate phrase IDF with sum of terms
compute actual tf of phrase
slop penalizes slight mismatches by edit-distance

Thanks!

And there‘s lots more to Lucene.
Check out http://jakarta.apache.org/lucene/.

Finally, search for this talk on Technorati.

本站僅提供存儲服務(wù)，所有內容均由用戶(hù)發(fā)布，如發(fā)現有害或侵權內容，請點(diǎn)擊舉報。

欧美性猛交XXXX免费看蜜桃,成人网18免费韩国,亚洲国产成人精品区综合,欧美日韩一区二区三区高清不卡,亚洲综合一区二区精品久久

Lucene

Prelude

Lucene is

Lucene Architecture

Lucene API

Package: org.apache.lucene.document

Example

Example (continued)

Package: org.apache.lucene.analysis

Example

Package: org.apache.lucene.index

Example

Some Inverted Index Strategies

Lucene‘s Index Algorithm

Indexing Diagram

Index Compression

Package: org.apache.lucene.search

Example

Search Algorithms

Lucene‘s Disjunctive Search Algorithm

Lucene‘s Conjunctive Search Algorithm

Scoring

Lucene‘s Phrase Scoring

Thanks!