A Short History of Search Engines from Archie to Google

Google has become so dominant in the world of search that it is easy to forget that there were search engines before Google. In fact, there were search engines before the web.

And what many people believe was the very first search engine was developed in Canada. It was called Archie, and it was the creation of a McGill University graduate student named Alan Emtage. In 1989, even before the birth of the worldwide web, he developed a way to search and index public internet-based archives. He called it Archie (archives without the "v"). He neglected to tell the people running the university's computer science program what he was up to, but Archie was an instant hit among students and other early adopters of the net

By 1992, Archie's server was using about half the available bandwidth in eastern Canada. A few thousand devoted users were asking about 50,000 queries a day. In pre-web world, Archie was an effective, if cumbersome way of finding documents on the internet. If you typed in the name of the file you were looking for, it could look around the net and find it, though when things got busy on weekday afternoons, it could take Archie several hours to respond to your query.

But as the popularity of the web exploded in the middle of the decade, going from 130 sites in 1993, to more than 600,000 in 1996, the race was on for a search engine nimble enough to keep up with that rate of growth.

Crawl, index, rank

Essentially, search engines do three things. First, they crawl the web using "spiders" or "bots" (short for robots). Early search engines were limited to crawling only the title of documents, but in 1994, a search engine named WebCrawler became the first spider to read the entire text of the document. How fast and how often the crawlers can crawl the web is an important factor in determining how useful your search results will be.

The information the spider gathers then goes into an index. The index is the equivalent of the card catalogue in a library, which imposes order onto the chaos of a large number of books and periodicals. And just as the card catalogue is limited to the holdings of that particular library, the index is limited to those pages that the crawler was able to crawl. When you type a query into your search engine, the results are not reflective of the entire web, but only that part of the web contained in the search engine's index. The bigger the index, the better the search results.

Those first two steps happen behind the scenes. The final step is the one we see when we type in a query. There is some kind of interface, some descriptive text, and most importantly, some kind of ranking system that has been imposed by the designers of the search engine. We might not understand why the sites were ranked the way they were, but if the first few listings are relevant to our search, we go away happy. If they are not, we think about trying another search engine.

Search gets hot

The first search engine to do all three steps effectively was AltaVista, which was born in California in December 1995. Up to that point, search engines had been limited to a single crawler, but no one spider was capable of keeping up with the thousands of new sites that were coming online every week. AltaVista was able to unleash a thousand spiders crawling in parallel formation across the web, allowing it to store millions of documents, virtually the entire web, in its index. By 1997, AltaVista was handling more than twenty million queries a day, was attracting millions of dollars in sponsorship revenue, and was the most popular search engine on the web.

The business model for search was still uncertain, but something big was clearly brewing, and in the heady days of the first dot.com boom, there was lots of Silicon Valley venture capital available to invest in search. By 1999, Lycos, HotBot, Excite and Yahoo (which was still using humans to create their search index) were all battling AltaVista for search engine supremacy.

But even as search was becoming an increasingly vital tool for both users and marketers trying to sell products and services online, the search experience itself was continuing to degrade. Most search engines relied heavily on scanning key words in the documents they were crawling. But they were not very good at distinguishing the meaning of the words they scanned, or the value of the document where they found it.

So a query for "jaguar" would have trouble distinguishing between the car and the animal, and search rankings were largely determined by how often the keywords appeared on the site, not whether the site could be trusted as an authoritative source. As a result, users were growing increasingly frustrated by the poor quality of the results their search engines were delivering.

Search goes to market

Marketers were also beginning to understand the enormous potential of search to delver customers to websites. In 1993, only 1.5% of web servers were on dot com domains. By 1997, more than 60% were. But web commerce needed search engines that could find your site and then rank it near the top of the results page. Users are notoriously reluctant to scroll down past the first few sites they see on their screen. The need to understand what was happening inside search engines gave birth to a new industry known as search engine optimization (SEO), whose practitioners promised they could help individuals and companies that were selling on the web to climb to the top of the search ranks.

The problem was there wasn't much law enforcement in the early days of the web, no sheriff to make sure everyone was playing by the rules. And many people weren't. Those early search engines were easy to fool. You could put white text on a white background that was invisible to a regular user, but visible to a search engine spider. Then you'd write the words "running shoes" a few hundred times on the page. The spider would see it and give a big boost to your site. But once users got there, they might discover the site was actually about pornography rather than running shoes. And even if it was about shoes, users would have no idea they were the victim of a spammer. Some search engines even allowed marketers to pay their way to a top ranking, without disclosing that information to users.

Larry and Segey to the rescue

By the end of 1997, the top search engines were crawling as many as 100 million documents, but the web was in danger of collapsing under its own weight. Without a search engine that could yield relevant results for users and consumers, and turn black hats into white hats, the future did not look bright. The extraordinary potential of the world wide web to make vast treasure troves of information available to everyone would never be realized.

Enter Sergey Brin and Larry Page, two Stanford graduate students looking to solve one of computer science's most intriguing challenges. In a ground-breaking paper published in 1998 called "The Anatomy of a Large-Scale Hypertextual Web Search Engine", they began by reviewing the problems confronting existing search engines.

Lists maintained by humans were too subjective, too expensive to build and maintain, and incapable of covering highly esoteric topics, but automated engines that relied on keyword matching returned too many "junk results". A search for "Bill Clinton" on one of the commercial search engines turned up "Bill Clinton Joke of the Day" as its number one result. Three of the four top search engines couldn't even find themselves when their names were typed into their search bars.

In addition to low quality results, Brin and Page believed existing search engines were incapable of protecting users against spammers and the "numerous companies which specialize in manipulating search engines for profit". They were also concerned about search engines that accepted advertising as part of their business model. "We expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers". And finally, they believed existing search engines were not capable of scaling up to answer the hundreds of millions of queries per day that they boldly predicted users would be making by the year 2000.

Clearly a new, more sophisticated, search engine would be needed to save the web from its own success.

Enter the algorithm

To build a better search engine, you need to build it a better brain, and the brain of the search engine, the source of all its extraordinary computational power is its algorithm.

The story of algorithms goes back to a man named Mohammed ibn-Musa al-Khwarizmi (b: 780- d: 850), who was a mathematician in the royal court in Baghdad in the ninth century. His signature work, The Compendious Book on Calculation by Completion and Balance, used systematic rules and geometric arguments to solve mathematical problems, rather than abstract notations. When the book was translated into Latin in the 12th century, Al-Khwarizmi became known as Algoritmi, and a new field of mathematics called algebra was born, and so too was a new tool for problem solving called algorithms.

To understand what an algorithm is, it is not necessary to delve into the realm of higher mathematics. In fact, mathematicians have been arguing about what the word means for the past 200 years. For Arnold Rosenbloom, who teaches computer science at the Mississauga campus of the University of Toronto, an algorithm is simply "a sequence of instructions written in enough detail that you can reasonably expect another person to carry them out to achieve a certain goal".

By that definition, Rosenbloom argues, an apple pie recipe can be considered an algorithm, so too can instructions on how to change a light bulb. As long as the instructions are written down, and are detailed enough that another person would be able to complete the required task, it's an algorithm.

Of course, most algorithms are expressed numerically, and today, most are written for computers. A computer program is simply an algorithm written for a computer rather than a human. The big difference lies in the level of detail required. A "pinch of salt" might be an adequate instruction for a person completing a recipe, but a computer would require more precision.

And as computers have become faster and more powerful, and as the costs of storage and bandwidth have plummeted, there is virtually no limit to the size and complexity of computer algorithms. And it's likely that no algorithm in recent times has had as significant an impact on the world as the one developed by Sergey Brin and Larry Page to solve the problems of search.

PageRank

Birn and Page's algorithm was designed to allow web spiders to crawl hundreds of millions of pages, index them and rank them in a way that would answer users' queries with high quality, relevant results. Their key insight was borrowed from the world of academics.

To a large extent, the value of an academic paper is determined by how often other experts in the field reference it in their own work. In books or journal articles, those references appear in the form of footnotes and annotations. But on the web, they appear as links and the text surrounding that link. Birn and Page argued that one important way of measuring of the quality of a web page would be to look at how many people were linking back to it, and who those people are. If respected scholars linked to your site, the algorithm would give that more weight (known as "link juice") than if the links came from undergraduates, and would rank it more highly.

Their new algorithm was called PageRank (after Larry) and it instantly began returning better results than the other commercial search engines. When a user typed "Bill Clinton" into Google, the name they gave to their new search engine, the first result was not a joke of the day site, but whitehouse.gov.

By focusing on links rather than keywords, Brin and Page had shifted search results away from keyword quantity to page quality. And quality, as defined by the authority of the people who were linking to you, was hard to fake, since most of the value of your site in Google's eyes was based on what was happening outside your page, and therefore out of your control.

Spammers would have a lot harder time having their way with Google than they had with other search engines. "We are optimistic.," they wrote at the end of their Stanford paper, "that our centralized web search engine architecture will improve in its ability to cover the pertinent text information over time and that there is a bright future for search."

"A bright future for search?" Let's call that one prediction that definitely came true.

Comments are closed.