Lucene
Doug Cutting
cutting@apache.org
November 24 2004
University of Pisa
Prelude
- my background..
- please interrupt with questions
- blog this talk now so that we can search for it later
(using a Lucene-based blog search engine, of course)
- In this course, Paolo and Antonio have presented many techniques.
- I present real software that uses many of these techniques.
Lucene is
Lucene Architecture
[draw on whiteboard for reference throughout talk]
Lucene API
- org.apache.lucene.document
- org.apache.lucene.analysis
- org.apache.lucene.index
- org.apache.lucene.search
Package: org.apache.lucene.document
- A Document is a sequence of Fields.
- A Field is a <name, value> pair.
- name is the name of the field, e.g., title, body, subject, date, etc.
- value is text.
Field values may be stored, indexed or analyzed (and, now, vectored).
Example
public Document makeDocument(File f) throws FileNotFoundException {
Document doc = new Document();
doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));
doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
Store.YES, Index.UN_TOKENIZED));
// Reader implies Store.NO and Index.TOKENIZED
doc.add(new Field("contents", new FileReader(f)));
return doc;
}
Example (continued)
field
| stored
| indexed
| analyzed
|
path
| yes
| yes
| yes
|
modified
| yes
| yes
| no
|
content
| no
| yes
| yes
|
Package: org.apache.lucene.analysis
- An Analyzer is a TokenStream factory.
- A TokenStream is an iterator over Tokens.
- input is a character iterator (Reader)
- A Token is tuple <text, type, start, length, positionIncrement>
- text (e.g., “pisa”).
- type (e.g., “word”, “sent”, “para”).
- start & length offsets, in characters (e.g, <5,4>)
- positionIncrement (normally 1)
standard TokenStream implementations are - Tokenizers, which divide characters into tokens and
- TokenFilters, e.g., stop lists, stemmers, etc.
Example
public class ItalianAnalyzer
extends Analyzer {
private Set stopWords =
StopFilter.makeStopSet(new String[] {"il", "la", "in"};
public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopWords);
result = new SnowballFilter(result, "Italian");
return result;
}
}
Package: org.apache.lucene.index
- Term is <fieldName, text>
- index maps Term → <df, <docNum, <position>* >*>
- e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>
- new: term vectors!
Example
IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
writer.addDocument(makeDocument(files[i]));
}
writer.close();
Some Inverted Index Strategies
- batch-based: use file-sorting algorithms (textbook)
+ fastest to build
+ fastest to search
- slow to update
- b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)
+ fast to search
- update/build does not scale
- complex implementation
- segment based: lots of small indexes (Verity)
+ fast to build
+ fast to update
- slower to search
Lucene‘s Index Algorithm
- two basic algorithms:
- make an index for a single document
- merge a set of indices
- incremental algorithm:
- maintain a stack of segment indices
- create index for each incoming document
- push new indexes onto the stack
- let b=10 be the merge factor; M=∞
for (size = 1; size < M; size *= b) {
if (there are b indexes with size docs on top of the stack) {
pop them off the stack;
merge them into a single index;
push the merged index onto the stack;
} else {
break;
}
}
- optimization: single-doc indexes kept in RAM, saves system calls
- notes:
- average b*logb(N)/2 indexes
- N=1M, b=2 gives just 20 indexes
- fast to update and not too slow to search
- batch indexing w/ M=∞, merge all at end
- equivalent to external merge sort, optimal
- segment indexing w/ M<∞
Indexing Diagram
- b = 3
- 11 documents indexed
- stack has four indexes
- grayed indexes have been deleted
- 5 merges have occurred
Index Compression
For keys in Term -> ... map, use technique from
Paolo‘s slides:
For values in Term -> ... map, use technique from
Paolo‘s slides:
VInt Encoding Example
| Value | First byte | Second byte | Third byte |
| 0 | 00000000 |
|
|
| 1 | 00000001 |
|
|
| 2 | 00000010 |
|
|
| ... |
|
|
|
| 127 | 01111111 |
|
|
| 128 | 10000000 | 00000001 |
|
| 129 | 10000001 | 00000001 |
|
| 130 | 10000010 | 00000001 |
|
| ... |
|
|
|
| 16,383 | 11111111 | 01111111 |
|
| 16,384 | 10000000 | 10000000 | 00000001 |
| 16,385 | 10000001 | 10000000 | 00000001 |
| ... |
|
|
|
This provides compression while still being efficient to decode.
Package: org.apache.lucene.search
- primitive queries:
- TermQuery: match docs containing a Term
- PhraseQuery: match docs w/ sequence of Terms
- BooleanQuery: match docs matching other queries.
e.g., +path:pisa +content:“Doug Cutting” -path:nutch
new: SpansQuery
derived queries:
- PrefixQuery, WildcardQuery, etc.
Example
Query pisa = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel"));
PhraseQuery leaningTower = new PhraseQuery();
leaningTower.add(new Term("content", "leaning"));
leaningTower.add(new Term("content", "tower"));
BooleanQuery query = new BooleanQuery();
query.add(leaningTower, Occur.MUST);
query.add(pisa, Occur.SHOULD);
query.add(babel, Occur.MUST_NOT);
Search Algorithms
From
Paolo‘s slides:
Lucene‘s Disjunctive Search Algorithm
- described in http://lucene.sf.net/papers/riao97.ps
- since all postings must be processed
- goal is to minimize per-posting computation
- merges postings through a fixed-size array of accumulator buckets
- performs boolean logic with bit masks
- scales well with large queries
[ draw a diagram to illustrate? ]
Lucene‘s Conjunctive Search Algorithm
From
Paolo‘s slides:
Algorithm
- use linked list of pointers to doc list
- initially sorted by doc
- loop
- if all are at same doc, record hit
- skip first to-or-past last and move to end of list
Scoring
From
Paolo‘s slides:
Is very much like
Lucene‘s Similarity.
Lucene‘s Phrase Scoring
- approximate phrase IDF with sum of terms
- compute actual tf of phrase
- slop penalizes slight mismatches by edit-distance
Thanks!
And there‘s lots more to Lucene.
Check out
http://jakarta.apache.org/lucene/.
Finally, search for this talk on
Technorati.