欧美性猛交XXXX免费看蜜桃,成人网18免费韩国,亚洲国产成人精品区综合,欧美日韩一区二区三区高清不卡,亚洲综合一区二区精品久久

打開(kāi)APP
userphoto
未登錄

開(kāi)通VIP,暢享免費電子書(shū)等14項超值服

開(kāi)通VIP
Outgoing: MapReduce

MapReduce

Jeffrey Dean and Sanjay Ghemawat of Google have written a paper about a method of processing large data sets they call MapReduce.

Many will be familiar with the functional programming constructs of map and reduce.  Map applies a function against each element of a list to get a transformed version of the list.  For example, in Python, map(chr, [97,98,99]) transforms a list of three numbers into a list containing the equivalent characters:

>>> map(chr, [97,98,99])
[‘a(chǎn)‘, ‘b‘, ‘c‘]

It‘s as if you executed [chr(97),chr(98),chr(99)].

Reduce takes a function and runs it against items in the list, resulting in a single value:

>>> reduce(operator.add, [‘a(chǎn)‘,‘b‘,‘c‘])
‘a(chǎn)bc‘

This is the string formed by the operations (‘a(chǎn)‘+‘b‘)+‘c‘.  This programming style lends itself naturally to nesting:

>>> reduce(operator.add, map(chr, [97,98,99]))
‘a(chǎn)bc‘

The functional aspects of these operations are similar to Unix filters where files get piped from one filter to another.  Here‘s a filter that will take a file of MARC21 records, transform the end-of-record markers to line-feeds, select records with the word ‘smollet‘ in them and then count them:

cat clinker.marcu | tr ‘\035‘ ‘\n‘ | grep -iw ‘smollet‘ | wc -l

Comparing this to map/reduce the cat, tr, and grep commands are similar to map and the wc command to reduce.

The Google model is that given a set of key/value pairs the map function produces a new set of key/value pairs based on a function supplied by the programmer.  The reduce function collapses all the values for a given key to a single value.  Google has found that offering a robust implementation of this that can run in a massively parallel environment (thousands of nodes) has made it possible to routinely process huge files in many different ways.  The slides offer a good overview of their work.

Here‘s a more involved example written in Python that closely follows the Google approach:

First we need a list to process as input:

((1,‘boy‘),(2,‘dog‘),(3,‘cat‘),(4,‘a(chǎn)ardvark‘), (5,‘cat‘))

This is a list of 5 key-value pairs.  You might think of the key as record number and the string as the record.

Here‘s our map function.  It takes in a list of key-value pairs, such as our input, and returns a new list with the string as the key with the record number as the value if it finds an ‘a(chǎn)‘ in the string:

def myMap(gen): return ( (v,k) for k,v in gen if v.find(‘a(chǎn)‘)!=-1)

For our input list, this returns:

((‘cat‘, 3), (‘a(chǎn)ardvark‘, 4), (‘cat‘, 5))

Next this list gets grouped so that all the record numbers for each word are collected together.  You can find the code to do this at the end of the post.  Here‘s the grouped list it outputs:

((‘a(chǎn)ardvark‘, [4]), (‘cat‘, [3, 5]))

This shows that ‘a(chǎn)ardvark‘ occurred in record 4, ‘cat‘ in records 3 and 5.

A reduce function that outputs each word with a count:

def myReduce(gen): return ((k, len(v)) for k,v in gen)

From the grouped results this will generate:

((‘a(chǎn)ardvark‘, 1), (‘cat‘, 2))

What Google has done is take the map/reduce paradigm and make it work in parallel in their environment of thousands of millions of records.  Our work with our own (somewhat smaller scale) Beowulf cluster made us think we could usefully use many of their concepts in our own processing of tens of millions of bibliographic records.  (Actually OCLC has more than a thousand million records, but we don‘t maintain those online yet).

More on our own work with MapReduce (which we are doing in Python) in a subsequent post.

I imagine there are a number of implementations of MapReduce.  The Nutch project has a Java implementation.

Here is all the code in one spot, including the group function that is run between map and reduce:

input = ((1,‘boy‘),(2,‘dog‘),(3,‘cat‘),(4,‘a(chǎn)ardvark‘), (5,‘cat‘))
def doMap(gen): return ( (v,k) for k,v in gen if v.find(‘a(chǎn)‘)!=-1)
def doReduce(gen): return ((k, len(v)) for k,v in gen)
def group(gen):             # accept a list of key,value pairs
    sl = sorted(list(gen))  # sort
    if not sl: return         # might be empty
    rkey, rlist = sl[0][0], [sl[0][1]] # a key and list to return
    for k,v in sl[1:]:        # process rest of sorted list
        if k==rkey:
            rlist.append(v) # extend the list for this key
        else:
            yield (rkey, rlist) # output key & list
            rkey, rlist = k, [v]# start next key & list
    yield(rkey, rlist)          # output last key & list
print tuple(doReduce(group(doMap(input))))

--Th & Jenny Toves

本站僅提供存儲服務(wù),所有內容均由用戶(hù)發(fā)布,如發(fā)現有害或侵權內容,請點(diǎn)擊舉報。
打開(kāi)APP,閱讀全文并永久保存 查看更多類(lèi)似文章
猜你喜歡
類(lèi)似文章
你的編程語(yǔ)言能這樣做嗎?
Meng Yan ( 孟巖 ) @ Weblog Blog Archive Map R...
Hadoop源代碼分析(MapReduce概論)
8天學(xué)通MongoDB——第三天 細說(shuō)高級操作
基于Hadoop的Mapreduce編程(一)
深入理解函數式編程
更多類(lèi)似文章 >>
生活服務(wù)
分享 收藏 導長(cháng)圖 關(guān)注 下載文章
綁定賬號成功
后續可登錄賬號暢享VIP特權!
如果VIP功能使用有故障,
可點(diǎn)擊這里聯(lián)系客服!

聯(lián)系客服

欧美性猛交XXXX免费看蜜桃,成人网18免费韩国,亚洲国产成人精品区综合,欧美日韩一区二区三区高清不卡,亚洲综合一区二区精品久久