亚洲中文字幕无码永久免弗_ 使用XML WordNet 服務(wù)器系統增強搜索引擎

使用XML WordNet 服務(wù)器系統增強搜索引擎時(shí)間：2007-02-09 聯(lián)系：linuxmine#gmail.com

http://luojie.360doc.com luojie25@163.com

　　本專(zhuān)欄的上一期中，Uche Ogbuji 介紹了 WordNet 自然語(yǔ)言數據庫，并說(shuō)明如何用 XML 表示數據庫節點(diǎn)和通過(guò) Web 提供該 XML 文檔。本文介紹如何將 XML 轉換成 RDF 表示，以及如何使用 WordNet XML 服務(wù)器改進(jìn)搜索引擎技術(shù)。

　　上兩期文章，“查詢(xún) XML 格式的 WordNet” 和 “Thinking XML: 以 XML 形式提供 WordNet ” 中，我給出了對 WordNet 項目進(jìn)行基于 XML 的處理的代碼。WordNet 代表了和本專(zhuān)欄中心主題平行的一個(gè)重要研究方向：Thinking XML 討論 XML 的語(yǔ)義，WordNet 則提供了自然語(yǔ)言本身語(yǔ)義的粗略模型。這里說(shuō)粗略并非輕視，因為數千年來(lái)的語(yǔ)言哲學(xué)表明要建立自然語(yǔ)言真正嚴格意義上的模型是很難的（甚至是不可能的）。目前對基于 Web 的系統（包括 XML）建模的最廣泛的系統是 RDF。因此本文將提出到目前為止一直討論的 XML WordNet 系統的 RDF 表示作為結束。我還將說(shuō)明如何使用 WordNet XML 表示和服務(wù)器來(lái)改進(jìn)搜索引擎。

　　源自 XML 的 RDF

　　要創(chuàng )建 RDF/XML 格式，您可以在本系列的第一期文章關(guān)于 XML 表示的討論中找到 WordNet 需要的所有信息內容。首先看清單 1，它給出了單詞 “selection” 的同義詞集（synset）的例子。

　　清單 1. 關(guān)于單詞 “selection” 的第一個(gè)序列化的同義詞集

　　<?xml version="1.0" encoding="UTF-8"?> <noun xml:id="152253"> 　<gloss>the act of choosing or selecting; "your choice of colors was 　unfortunate"; "you can take your pick"</gloss> 　<word-form>choice</word-form> 　<word-form>selection</word-form> 　<word-form>option</word-form> 　<word-form>pick</word-form> 　<hypernym part-of-speech="noun" target="32816"/> 　<frames part-of-speech="verb" target="653781"/> 　<frames part-of-speech="verb" target="656613"/> 　<frames part-of-speech="verb" target="652154"/> 　<hyponym part-of-speech="noun" target="152613"/> 　<hyponym part-of-speech="noun" target="152749"/> 　<hyponym part-of-speech="noun" target="152898"/> 　<hyponym part-of-speech="noun" target="153642"/> 　<hyponym part-of-speech="noun" target="154057"/> 　<hyponym part-of-speech="noun" target="170871"/> 　<hyponym part-of-speech="noun" target="173378"/> </noun> 　

　　現在遇到了最困難的部分：決定采用何種 RDF 模式。圍繞著(zhù) WordNet 的 RDF 表示有大量的活動(dòng)。最近的官方研究是 W3C 在 2004 年中期發(fā)表的 “Wordnet in RDFS and OWL” 初步草案。但是這項工作還遠遠沒(méi)有完成，其他一些人和組織，包括 Chilean 的研究員 Alvaro Graves，已經(jīng)在著(zhù)手改進(jìn)它。（關(guān)于這些研究的更多信息請參閱參考資料。）我決定建立一種輕型的 RDF/XML 表示，與 W3C 的研究兼容，但是沒(méi)有使用其中那些含混的內容。如果用這種格式表示，清單 1 的等價(jià)形式如清單 2 所示。

　　清單 2. 清單 1 的 RDF/XML 表示

　　<?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:wn="http://uche.ogbuji.net/tech/rdf/wordnet/" 　　　　 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> 　<wn:SynSet rdf:about="noun/152253"> 　<wn:glossaryEntry> 　　the act of choosing or selecting; "your choice of colors was unfortunate"; 　　"you can take your pick" 　</wn:glossaryEntry> 　<wn:wordForm>choice</wn:wordForm> 　<wn:wordForm>selection</wn:wordForm> 　<wn:wordForm>option</wn:wordForm> 　<wn:wordForm>pick</wn:wordForm> 　<wn:hypernym rdf:resource="noun/32816"/> 　<wn:frames rdf:resource="verb/653781"/> 　<wn:frames rdf:resource="verb/656613"/> 　<wn:frames rdf:resource="verb/652154"/> 　<wn:hyponym rdf:resource="noun/152613"/> 　<wn:hyponym rdf:resource="noun/152749"/> 　<wn:hyponym rdf:resource="noun/152898"/> 　<wn:hyponym rdf:resource="noun/153642"/> 　<wn:hyponym rdf:resource="noun/154057"/> 　<wn:hyponym rdf:resource="noun/170871"/> 　<wn:hyponym rdf:resource="noun/173378"/> 　</wn:SynSet> </rdf:RDF>

　　W3C 還沒(méi)有為 RDF 格式的 WordNet 建立名稱(chēng)空間，我暫時(shí)選擇了 http://uche.ogbuji.net/tech/rdf/wordnet/。我使用和 WordNet 指針同名的關(guān)系建立同義詞集間的指針（hypernym、frames 等）。W3C 工作組似乎傾向于專(zhuān)門(mén)的屬性名（如 hypernymOf），我不認為這是一個(gè)好主意，因為每當 WordNet 建立新的基于指針的關(guān)系時(shí)都要重新修訂模式。

　　進(jìn)行轉換

　　我一直感興趣的是確定多大程度上 XML 本身可以作為 RDF 模型的源格式，而不需要各種繁瑣的 RDF/XML 格式。上一期文章（請參閱參考資料）已經(jīng)介紹了這一點(diǎn)。我發(fā)現最可行的辦法是使用 XSLT 將 XML 轉化成 RDF/XML，然后再導入一個(gè) RDF 模型。理想情況下可使用 4Suite 資料庫（參見(jiàn) 參考資料）這樣的工具，完全不需要考慮中間的 RDF/XML 格式?，F在已經(jīng)看到了 WordNet 將要使用的 RDF/XML 格式（清單 2），因此只需要一個(gè)從 XML 轉化到 RDF/XML 格式的 XSLT。清單 3 就是這樣的轉換。

　　清單 3. 從 WordNet XML 轉換到 RDF/XML 格式的 XSLT

　　<?xml version="1.0" encoding="utf-8"?> <xsl:transform version="1.0" 　xmlns:xsl = "http://www.w3.org/1999/XSL/Transform" 　xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#" 　xmlns:wn = "http://uche.ogbuji.net/tech/rdf/wordnet/" 　xml:base = "http://uche.ogbuji.net/tech/rdf/wordnet/" > 　<xsl:output indent="yes"/> 　<xsl:template match="/"> 　　<rdf:RDF> 　　　<xsl:apply-templates/> 　　</rdf:RDF> 　</xsl:template> 　<xsl:template match="noun|verb|adjective|adverb"> 　　<wn:SynSet rdf:about="{name()}/{@xml:id}"> 　　　<xsl:apply-templates/> 　　</wn:SynSet> 　</xsl:template> 　<xsl:template match="gloss"> 　　<wn:glossaryEntry><xsl:copy-of select="node()"/></wn:glossaryEntry> 　</xsl:template> 　<xsl:template match="word-form"> 　　<wn:wordForm><xsl:value-of select="."/></wn:wordForm> 　</xsl:template> 　<xsl:template match="*"> 　　<xsl:element namespace="http://uche.ogbuji.net/tech/rdf/wordnet/" 　　　　 name="wn:{name()}"> 　　　<xsl:attribute namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 　　　　　　 name="rdf:resource"> 　　<xsl:value-of select="concat(@part-of-speech, ‘/‘, @target)"/> 　　　</xsl:attribute> 　　</xsl:element> 　</xsl:template> </xsl:transform>

　　給搜索一擊

　　在很早以前的一篇文章中，我介紹了如何使用基于 RDF 的 WordNet 數據庫為應用程序專(zhuān)用的搜索引擎增加一些自然語(yǔ)言的能力。我用 WordNet 的所有同義詞集組成了清單 2 所示的 RDF 表示，然后對結果數據庫進(jìn)行類(lèi)似的單一查詢(xún)。但這一次我選擇了另一種方法，從本系列上一期文章開(kāi)發(fā)的 WordNet 服務(wù)器上遞歸查詢(xún) XML。

　　要解決的問(wèn)題是，給定一個(gè)要搜索的詞，我們希望能夠將搜索擴展到某些關(guān)系密切的詞。如果用戶(hù)搜索單詞 “selection”，代碼應該返回包含 “vote”、“choice” 和 “ballot” 這類(lèi)詞的結果。這就需要從檢索詞開(kāi)始查詢(xún) WordNet 服務(wù)器，然后沿著(zhù)相關(guān)詞匯的指針遞規地進(jìn)行查詢(xún)。本文中將指針限制為下義詞（hyponym），一般而言就是查找與檢索詞有關(guān)的表示更具體概念的其他詞匯。

　　這個(gè)問(wèn)題需要進(jìn)一步細化上一期文章中給出的代碼，具體來(lái)說(shuō)就是需要 WordNet 服務(wù)器返回原始的 XML 表示的同義詞集，而不僅僅返回完整的單詞形式。清單 4 是上一期文章中清單 1 所示的 CherryPy Web 服務(wù)器的特化。它允許從基于 WordNet 指針的 URL 返回原始同義詞集 XML，比如 http://localhost:8080/raw/pointer/noun/5955443。

　　清單 4. 上一期文章中給出的 CherryPy Web 服務(wù)器的特化

　　import cherrypy from picket import Picket, PicketFilter from wnxmllib import * class root: 　　_cpFilterList = [ PicketFilter(defaultStylesheet="viewword.xslt") ] class wordform_handler: 　　def __init__(self, applyxslt=False): 　　　　self.applyxslt = applyxslt 　　　　return 　　@cherrypy.expose 　　def default(self, word): 　　　　synsets = serialized_synsets_for_word(word) 　　　　result = ‘‘.join(synsets) #Concatenate strings in result list 　　　　#Wrap up the XML fragments into a full document 　　　　wordxml = ‘<word-senses text="‘+word+‘">‘+result+‘</word-senses>‘ 　　　　if self.applyxslt: 　　　　　　picket = Picket() 　　　　　　picket.document = wordxml 　　　　　　return picket #apply the XSLT and return the result 　　　　return wordxml class pointer_handler: 　　def __init__(self, applyxslt=False): 　　　　self.applyxslt = applyxslt 　　　　return 　　@cherrypy.expose 　　def default(self, pos, target): 　　　　synset = getSynset(pos, int(target)) 　　　　synsetxml = serialize_synset(synset) 　　　　if self.applyxslt: 　　　　　　picket = Picket() 　　　　　　picket.document = synsetxml 　　　　　　return picket #apply the XSLT and return the result 　　　　return synsetxml cherrypy.root = root() cherrypy.root.view = wordform_handler(applyxslt=True) cherrypy.root.raw = wordform_handler() cherrypy.root.pointer = pointer_handler(applyxslt=True) cherrypy.root.raw.pointer = pointer_handler() #Disable debugging messages in Web responses cherrypy.config.update({‘logDebugInfoFilter.on‘: False}) cherrypy.server.start()

　　清單 5 是接受一個(gè)單詞然后用 Python 集合返回下義詞鏈的客戶(hù)機代碼。它要求 WordNet XML 服務(wù)器運行清單 4 中修改后的代碼，后者需要 4Suite XML 1.0b3 版（請參閱參考資料）。相應修改 import 語(yǔ)句和 API 后也可使用其他的 XML/XPath 處理庫。

　　清單 5. 接受一個(gè)單詞然后用 Python 集合返回下義詞鏈的客戶(hù)機代碼

　　import sets import urllib from Ft.Xml import Parse BASEURI = ‘http://localhost:8080/‘ def get_hyponym_chain(word): 　　‘‘‘ 　　returns a list with all the hyponyms of a word, the hyponyms 　　of those hyponyms, and so on, recursively 　　‘‘‘ 　　accumulator = []　#Storage list for the hyponym chain 　　def process(xml): 　　　　‘‘‘ 　　　　extract the hyponym chain from a DOM node.　Common processing for 　　　　word-form and synset XML 　　　　‘‘‘ 　　　　hyponyms = xml.xpath(u‘//hyponym‘) 　　　　wforms = [e.xpath(u‘string()‘) 　　　　　　　　　for e in xml.xpath(u‘//word-form‘)] 　　　　accumulator.extend(wforms) 　　　　for hyponym in hyponyms: 　　　　　　pos = hyponym.xpath(u‘string(@part-of-speech)‘) 　　　　　　target = hyponym.xpath(u‘string(@target)‘) 　　　　　　expand_hyponyms(pos, target, accumulator) 　　　　return 　　def expand_hyponyms(pos, target, accumulator): 　　　　‘‘‘ 　　　　follow a pointer and extract the hyponym chain from the resulting XML 　　　　‘‘‘ 　　　　synsetxml = Parse(BASEURI + ‘raw/pointer/‘ + pos + ‘/‘ + target) 　　　　process(synsetxml) 　　　　return 　　#escape any spaces or other problem characters in the word 　　word = urllib.quote(word) 　　wordxml = Parse(BASEURI + ‘raw/‘ + word) 　　process(wordxml) 　　return sets.Set(accumulator) #eliminate dupes if __name__ == ‘__main__‘: 　　#If invoked from the command line, get the hyponym chain from the 　　#word given in the command-line parameters 　　import sys, pprint 　　print get_hyponym_chain(‘ ‘.join(sys.argv[1:]))

　　可能需要根據 WordNet XML 服務(wù)器的部署來(lái)編輯 BASEURI。如果在命令行中運行該程序并使用參數 “selection”，將得到下面的結果，即一組 Unicode 字符串（在格式上進(jìn)行了編輯）：

　　Set([u‘selection‘, u‘move‘, u‘co-option‘, u‘cut‘, u‘secret ballot‘, u‘manoeuvre‘, u‘juke‘, u‘casting‘, u‘survival‘, u‘Haftarah‘, u‘choice‘, u‘cutting‘, u‘safe harbor‘, u‘suicide pill‘, u‘shark repellent‘, u‘scorched-earth policy‘, u‘designation‘, u‘delegacy‘, u‘casting vote‘, u‘press cutting‘, u‘demarche‘, u‘epigraph‘, u‘security‘, u‘balloting‘, u‘precaution‘, u‘tactical maneuver‘, u‘fast one‘, u‘split ticket‘, u‘clipping‘, u‘stratified sampling‘, u‘a(chǎn)rtifice‘, u‘excerpt‘, u‘greenmail‘, u‘election‘, u‘parking‘, u‘naming‘, u‘write-in‘, u‘extract‘, u‘recognition‘, u‘pocket veto‘, u‘survival of the fittest‘, u‘decision‘, u‘track‘, u‘quote‘, u‘Haphtarah‘, u‘shtik‘, u‘schtik‘, u‘maneuver‘, u‘a(chǎn)nalecta‘, u‘veto‘, u‘citation‘, u‘a(chǎn)nalects‘, u‘step‘, u‘representative sampling‘, u‘favorite‘, u‘colouration‘, u‘sortition‘, u‘drawing lots‘, u‘pick‘, u‘schtick‘, u‘gambit‘, u‘stratagem‘, u‘misquotation‘, u‘cumulative vote‘, u‘sampling‘, u‘guard‘, u‘laying on of hands‘, u‘fake‘, u‘vote‘, u‘gimmick‘, u‘countermine‘, u‘ploy‘, u‘intention‘, u‘willing‘, u‘call‘, u‘lucky dip‘, u‘way‘, u‘footwork‘, u‘option‘, u‘Haphtorah‘, u‘feint‘, u‘determination‘, u‘safeguard‘, u‘security measures‘, u‘quotation‘, u‘press clipping‘, u‘ruse‘, u‘trick‘, u‘Haftorah‘, u‘porcupine provision‘, u‘shtick‘, u‘measure‘, u‘straight ticket‘, u‘twist‘, u‘mimesis‘, u‘a(chǎn)ppointment‘, u‘volition‘, u‘random sampling‘, u‘ordinance‘, u‘ballot‘, u‘poison pill‘, u‘pac-man strategy‘, u‘proportional sampling‘, u‘conclusion‘, u‘multiple voting‘, u‘golden parachute‘, u‘favourite‘, u‘a(chǎn)ssignment‘, u‘block vote‘, u‘device‘, u‘nomination‘, u‘coloration‘, u‘casting lots‘, u‘newspaper clipping‘, u‘co-optation‘, u‘ordination‘, u‘natural selection‘, u‘tactical manoeuvre‘, u‘pleasure‘, u‘voting‘, u‘resolution‘, u‘countermeasure‘])

　　有一點(diǎn)要注意，四年前我嘗試用大型 RDF 數據庫進(jìn)行這種同義詞驅動(dòng)的搜索時(shí)遇到了很多性能問(wèn)題。完成 “selection” 和下義詞的搜索花費了兩分多鐘。用目前的計算機再次嘗試時(shí)，這種方法用了大約一分鐘。但是采用本文提出的新方法，使用 WordNet XML 服務(wù)器，在同一臺計算機上用了不到一秒鐘，盡管期間涉及到幾次本地 HTTP 請求。這是因為 WordNet XML 服務(wù)器利用了 WordNet 數據庫所用的專(zhuān)用散列和索引，而不是查詢(xún)一般的 RDF 數據庫。它采用分而治之的方法在較小的范圍內處理 XML，而不是單一的查詢(xún)。再強調一次，RDF 有時(shí)候是一種很好的處理模型，但對于通用 DBMS 并不是最可行的語(yǔ)法，也不是最佳存儲形式。

　　結束語(yǔ)

　　我們用了三篇文章討論如何在 XML 和 RDF 應用程序中使用 WordNet 自然語(yǔ)言語(yǔ)義數據庫，這是最后一篇。希望為您構建這類(lèi)應用程序提供了一些工具。Princeton 的 English 語(yǔ)言 WordNet 項目是一項龐大的公共服務(wù)，正在取得不斷進(jìn)步。遺憾的是對其他語(yǔ)言的類(lèi)似研究出現得很慢。2001 年開(kāi)始發(fā)起的 EuroWordNet（請參閱參考資料）在對幾種歐洲語(yǔ)言進(jìn)行類(lèi)似的研究，但是沒(méi)有看到東亞或其他語(yǔ)言區開(kāi)展類(lèi)似的活動(dòng)。如果您了解此類(lèi)計劃，或者對這個(gè)主題有其他想法，請在 Thinking XML 討論論壇上分享。

本站僅提供存儲服務(wù)，所有內容均由用戶(hù)發(fā)布，如發(fā)現有害或侵權內容，請點(diǎn)擊舉報。

欧美性猛交XXXX免费看蜜桃,成人网18免费韩国,亚洲国产成人精品区综合,欧美日韩一区二区三区高清不卡,亚洲综合一区二区精品久久