How in the world does Google search billions and billions of webpages and still return results in half a second? And who decides which results are most relevant?
The Early Web
When the web first got started in 1993, there were two ways to find a website: knowing the address, or following a link from another page. While there were pre-web search engines, such as Archie, Veronica, and Jughead, they were used to find and retrieve files where the user knew the file name.
Search engines quickly sprung up. Excite, Yahoo, WebCrawler, Lycos, Infoseek, Alta Vista…there are a lot of competition! Unfortunately, limited by the technology of the time, most of the search engines weren’t very good. Then, in 1997, along came Google, and a new era began.
Managing the Web
So how does a search engine work? There are essentially two ways to create a list of what’s available on the internet.
The first is for the owners of what we call a web directory to list sites of interest themselves. Yahoo! started out this way; it was originally a listing of the author’s favorite websites. Today, the best known site of this kind is the Open Directory Project, which has nearly five million sites in over a million categories, and close to ninety thousand editors. Anyone can submit a site to be indexed, but it must be approved by an editor in the appropriate categories. While web directories may have a search function, the purpose is simply to list quality sites in each category and allow the user to browse through them, rather than to find the best results for a given query.
The second type of site, which is what we think of when we see the term search engine, is a database created by bots crawling the internet, which attempts to find the website that best matches the user’s query. The first full-text search engine, WebCrawler, indexed the entire text of every page starting in 1994. Later search engines, starting with Alta Vista in 1995, gained the ability to interpret user queries rather than looking only for an exact match.
Looking at Links
When Google was launched in 1997, it introduced a new concept: ranking a page based on inbound links. Prior to this, results were ordered based solely on on-page factors; specifically, matching text and meta tags. The predictable result: keyword stuffing. Unscrupulous website owners would stuff hundreds of keywords into the meta tags and the bottom of the pages for the sole purpose of getting the sites to rank more highly in the search engines.
Google ignored the keyword meta tags entirely. Instead, they looked at whether and how a page was being linked to. Pages with lots of links were considered to be more important, and the anchor text (the words that you click on) were used to determine what the page was about. On-page content was still the most important thing, but links now played a vital role. Google’s famous PageRank algorithm assigned an importance to every page on the internet.
Web spiders, which crawl the web to create the databases used by search engines, are simply programs that request web pages just like regular browsers. Spiders crawl previously visited pages to find new content, and also follow links (and sitemaps) to find new pages. How often a site is crawled varies; high-traffic sites that are updated frequently, such as newpaper sites, are obviously crawled more often than sites that rarely change.
Returning Search Results
After Google’s spiders crawl the web, they build an index of all the retrieved information. (In the past, updating the index happened once every few months, but now Google crawls the web and updates the index continuously). Rather than search every document when a query is made, Google has a great number of computers that hold lists of pages containing a given word.The example Google provides is that if someone searches for civil war, they can return every result that appears on both the list for civil and the list for war.
Although Google may tell you that they have millions and millions of results for your search, they never actually create the full list; instead, they just retrieve the most relevant results. Google uses over a hundred factors to rank each page and tries to return pages that are both relevant and reputable, which is why highly trusted sites like CNN and Wikipedia often who up at the top of the search results, but brand new sites can show up there as well if they have highly targeted information.
Every major search engine now offers vertical search,where you search only for content of a particular type or from a particular area. For example, Google Scholar searches academic papers, while Google Image Search, surprisingly enough, returns images. Universal Search is what Google calls their new search results, which often offer a mixture of standard search, image, and shopping results.
Google has their own channel on YouTube, where they’ve posted a number of videos about how various things work; you might be interested in the following:
What Comes Next?
Search is a continually evolving field, as the search engines compete to return more relevant results in less time, while also fending off spammers who attempt to rank irrelevant sites simply to make money off of advertising. Google already returns results in half a second, but claims that the new instant search, which displays results as you type, makes finding what you’re looking for even faster.
In the meantime, the major search engines are also experimenting with having humans answer questions; probably the best-known is Yahoo! Answers. How search engines work is constantly changing, but with everything that’s currently available, it’s never been so easy to find exactly what you’re looking for.