欧美性猛交XXXX免费看蜜桃,成人网18免费韩国,亚洲国产成人精品区综合,欧美日韩一区二区三区高清不卡,亚洲综合一区二区精品久久

打開(kāi)APP
userphoto
未登錄

開(kāi)通VIP,暢享免費電子書(shū)等14項超值服

開(kāi)通VIP
Lucene lecture at Pisa

Lucene



Doug Cutting
cutting@apache.org

November 24 2004
University of Pisa


Prelude

  • my background..
  • please interrupt with questions
  • blog this talk now so that we can search for it later
    (using a Lucene-based blog search engine, of course)
  • In this course, Paolo and Antonio have presented many techniques.
  • I present real software that uses many of these techniques.

Lucene is


Lucene Architecture


[draw on whiteboard for reference throughout talk]

Lucene API

  • org.apache.lucene.document
  • org.apache.lucene.analysis
  • org.apache.lucene.index
  • org.apache.lucene.search


Package: org.apache.lucene.document

  • A Document is a sequence of Fields.
  • A Field is a <name, value> pair.
    • name is the name of the field, e.g., title, body, subject, date, etc.
    • value is text.
  • Field values may be stored, indexed or analyzed (and, now, vectored).

Example

 public Document makeDocument(File f) throws FileNotFoundException {
Document doc = new Document();
doc.add(new Field("path", f.getPath(), Store.YES, Index.TOKENIZED));

doc.add(new Field("modified",
DateTools.timeToString(f.lastModified(), Resolution.MINUTE),
Store.YES, Index.UN_TOKENIZED));

// Reader implies Store.NO and Index.TOKENIZED
doc.add(new Field("contents", new FileReader(f)));

return doc;
}

Example (continued)

field
stored
indexed
analyzed
path
yes
yes
yes
modified
yes
yes
no
content
no
yes
yes


Package: org.apache.lucene.analysis

  • An Analyzer is a TokenStream factory.
  • A TokenStream is an iterator over Tokens.
    • input is a character iterator (Reader)
  • A Token is tuple <text, type, start, length, positionIncrement>
    • text (e.g., “pisa”).
    • type (e.g., “word”, “sent”, “para”).
    • start & length offsets, in characters (e.g, <5,4>)
    • positionIncrement (normally 1)
  • standard TokenStream implementations are
    • Tokenizers, which divide characters into tokens and
    • TokenFilters, e.g., stop lists, stemmers, etc.

Example

public class ItalianAnalyzer extends Analyzer {
  private Set stopWords = 
StopFilter.makeStopSet(new String[] {"il", "la", "in"};

public TokenStream tokenStream(String fieldName, Reader reader) {
TokenStream result = new WhitespaceTokenizer(reader);
 result = new LowerCaseFilter(result);
  result = new StopFilter(result, stopWords);
  result = new SnowballFilter(result, "Italian");
 return result;
 }
}

Package: org.apache.lucene.index

  • Term is <fieldName, text>
  • index maps Term → <df, <docNum, <position>* >*>
  • e.g., “content:pisa” → <2, <2, <14>>, <4, <2, 9>>>
  • new: term vectors!

Example

IndexWriter writer = new IndexWriter("index", new ItalianAnalyzer());
File[] files = directory.listFiles();
for (int i = 0; i < files.length; i++) {
writer.addDocument(makeDocument(files[i]));
}
writer.close();

Some Inverted Index Strategies

  1. batch-based: use file-sorting algorithms (textbook)
      + fastest to build
      + fastest to search
      - slow to update
  2. b-tree based: update in place (http://lucene.sf.net/papers/sigir90.ps)
      + fast to search
      - update/build does not scale
      - complex implementation
  3. segment based: lots of small indexes (Verity)
      + fast to build
      + fast to update
      - slower to search

Lucene‘s Index Algorithm

  • two basic algorithms:
    1. make an index for a single document
    2. merge a set of indices
  • incremental algorithm:
    • maintain a stack of segment indices
    • create index for each incoming document
    • push new indexes onto the stack
    • let b=10 be the merge factor; M=∞
      for (size = 1; size < M; size *= b) {
        if (there are b indexes with size docs on top of the stack) {
         pop them off the stack;
         merge them into a single index;
         push the merged index onto the stack;
       } else {
         break;
       }
      }
  • optimization: single-doc indexes kept in RAM, saves system calls
  • notes:
    • average b*logb(N)/2 indexes
      • N=1M, b=2 gives just 20 indexes
      • fast to update and not too slow to search
    • batch indexing  w/ M=∞, merge all at end
      • equivalent to external merge sort, optimal
    • segment indexing w/ M<∞

Indexing Diagram

  • b = 3
  • 11 documents indexed
  • stack has four indexes
  • grayed indexes have been deleted
  • 5 merges have occurred


Index Compression

For keys in Term -> ... map, use technique from Paolo‘s slides:


For values in Term -> ... map, use technique from Paolo‘s slides:



VInt Encoding Example

Value

First byte

Second byte

Third byte

0

00000000



1

00000001



2

00000010



...




127

01111111



128

10000000

00000001


129

10000001

00000001


130

10000010

00000001


...




16,383

11111111

01111111


16,384

10000000

10000000

00000001

16,385

10000001

10000000

00000001

...




This provides compression while still being efficient to decode.


Package: org.apache.lucene.search

  • primitive queries:
    • TermQuery: match docs containing a Term
    • PhraseQuery: match docs w/ sequence of Terms
    • BooleanQuery: match docs matching other queries.
      e.g., +path:pisa +content:“Doug Cutting” -path:nutch
  • new: SpansQuery
  • derived queries:
    • PrefixQuery, WildcardQuery, etc.

Example

Query pisa = new TermQuery(new Term("content", "pisa"));
Query babel = new TermQuery(new Term("content", "babel"));

PhraseQuery leaningTower = new PhraseQuery();
leaningTower.add(new Term("content", "leaning"));
leaningTower.add(new Term("content", "tower"));

BooleanQuery query = new BooleanQuery();
query.add(leaningTower, Occur.MUST);
query.add(pisa, Occur.SHOULD);
query.add(babel, Occur.MUST_NOT);

Search Algorithms

From Paolo‘s slides:


Lucene‘s Disjunctive Search Algorithm

  • described in http://lucene.sf.net/papers/riao97.ps
  • since all postings must be processed
    • goal is to minimize per-posting computation
  • merges postings through a fixed-size array of accumulator buckets
  • performs boolean logic with bit masks
  • scales well with large queries

[ draw a diagram to illustrate? ]

Lucene‘s Conjunctive Search Algorithm

From Paolo‘s slides:


Algorithm
  • use linked list of pointers to doc list
  • initially sorted by doc
  • loop
    • if all are at same doc, record hit
    • skip first to-or-past last and move to end of list

Scoring

From Paolo‘s slides:

Is very much like Lucene‘s Similarity.

Lucene‘s Phrase Scoring

  • approximate phrase IDF with sum of terms
  • compute actual tf of phrase
  • slop penalizes slight mismatches by edit-distance

Thanks!

And there‘s lots more to Lucene.
Check out http://jakarta.apache.org/lucene/.

Finally, search for this talk on Technorati.


本站僅提供存儲服務(wù),所有內容均由用戶(hù)發(fā)布,如發(fā)現有害或侵權內容,請點(diǎn)擊舉報。
打開(kāi)APP,閱讀全文并永久保存 查看更多類(lèi)似文章
猜你喜歡
類(lèi)似文章
發(fā)布IK Analyzer 3.0 中文分詞器
solr使用教程五【面試+工作】
lucene多種搜索方式詳解例子
[原創(chuàng )]全文搜索引擎Lucene學(xué)習筆記(頁(yè) 1) - 『 編程設計 』 - 青韶論壇 湘...
lucene入門(mén)與使用
Lucene 3.6.1:中文分詞、創(chuàng )建索引庫、排序、多字段分頁(yè)查詢(xún)以及高亮顯示
更多類(lèi)似文章 >>
生活服務(wù)
分享 收藏 導長(cháng)圖 關(guān)注 下載文章
綁定賬號成功
后續可登錄賬號暢享VIP特權!
如果VIP功能使用有故障,
可點(diǎn)擊這里聯(lián)系客服!

聯(lián)系客服

欧美性猛交XXXX免费看蜜桃,成人网18免费韩国,亚洲国产成人精品区综合,欧美日韩一区二区三区高清不卡,亚洲综合一区二区精品久久