| civil | 3 | 8 | 22 | 56 | 68 | 92 | |
| war | 2 | 8 | 15 | 22 | 68 | 77 | |
| both words | 8 | 22 | 68 |
Arranging the documents this way makes clear that the words "civil" and "war" appear in three documents (8, 22, and 68). The list of documents that contain a word is called a "posting list," and looking for documents with both words is called "intersecting a posting list." (A fast way to intersect two posting lists is to walk down both at the same time. If one list skips from 22 to 68, you can skip ahead to document 68 on the other list as well.)
An exercise for students Once you see how to intersect two words in an index, it‘s not hard to do it for three or more words as well. Here‘s a fun exercise: try to find all the documents below that contain the words "civil" and "war" and "reconstruction."civil: 1 9 15 19 22 35 38 48 53 55 65 68 73 78 82 88 91 99 war: 15 18 25 29 31 35 37 40 42 46 48 65 75 85 91 96 reconstruction: 35 42 48 64 73 91 95 The answer is at the end of the article. |
Ranking Results
Now we have the set of pages that contain the user‘s query somewhere, and it‘s time to rank them in terms of relevance. Google uses many factors in ranking. Of these, the PageRank algorithm might be the best known. PageRank evaluates two things: how many links there are to a web page from other pages, and the quality of the linking sites. With PageRank, five or six high-quality links from websites such as www.cnn.com and www.nytimes.com would be valued much more highly than twice as many links from less reputable or established sites.
But we use many factors besides PageRank. For example, if a document contains the words "civil" and "war" right next to each other, it might be more relevant than a document discussing the Revolutionary War that happens to use the word "civil" somewhere else on the page. Also, if a page includes the words "civil war" in its title, that‘s a hint that it might be more relevant than a document with the title "19th Century American Clothing." In the same way, if the words "civil war" appear several times throughout the page, that page is more likely to be about the civil war than if the words only appear once.
An exercise for students Pretend that you‘re a search engine. Pick a query like civil war or recycling or whatever you want. Search for the phrase on Google, pick three or four pages from the results, and print them out. On each printout, find the individual words from your query (such as "civil" and "war") and use a highlighter to mark each word with color. Do that for each of the 3-5 documents that you print out. Now tape those documents on a wall, step back a few feet, and squint your eyes. If you didn‘t know what the rest of a page said, and could only judge by the colored words, which document do you think would be most relevant? Is there anything that would make a document look more relevant to you? Is it better to have the words be in a large heading or to occur several times in a smaller font? Do you prefer it if the words are at the top or the bottom of the page? How often do the words need to appear? See if you can come up with 2-3 things you would look for to see if a document matched a query well. This can help students learn to evaluate website relevance the way a search engine would evaluate it so that they can better understand why a search engine returns certain results over others. |
As a rule, Google tries to find pages that are both reputable and relevant. If two pages appear to have roughly the same amount of information matching a given query, we‘ll usually try to pick the page that more trusted websites have chosen to link to. Still, we‘ll often elevate a page with fewer links or lower PageRank if other signals suggest that the page is more relevant. For example, a web page dedicated entirely to the civil war is often more useful than an article that mentions the civil war in passing, even if the article is part of a reputable site such as Time.com.
Once we‘ve made a list of documents and their scores, we take the documents with the highest scores as the best matches. Google does a little bit of extra work to try to show snippets – a few sentences – from each document that highlight the words that a user typed. Then we return the ranked URLs and the snippets to the user as results pages.
As you can see, running a search engine takes a lot of computing resources. For each search that someone types in, over 500 computers may work together to find the best documents, and it all happens in under half a second.
| Did you know? On April 1, 2002, we spoofed our PageRank algorithm by presenting a detailed explanation of " | ![]() |
[Answer: Only documents 35, 48, and 91 contain all three words "civil" and "war" and "reconstruction."]
Matt Cutts is a software engineer in the quality group at Google. He spends his days trying to help good sites rank where they should and developing techniques that keep deceptive or spammy sites from showing up in Google‘s search. He also has a web log at http://www.mattcutts.com/blog/ that often discusses webmaster issues.
Other questions? Send us a note at librarian-newsletter@google.com. Every newsletter we’ll try to answer 1 or 2 of the most frequently asked questions.
聯(lián)系客服