Het begin van de zoekmachine


Elke zoekmachine is klein begonnen. Voelspriet kijkt voor de fijnproevers terug naar het begin. Lees alles over de oorspronkelijke techniek achter de zoekmachine en de reden van oprichting.
 

Hoe het begon

De grootvader van alle zoekmachines, Archie, is opgericht in 1990. Lees alles over de eerste zoekmachine Archie en diens opvolgers.

A History Of Search Engines : over Archie, Aliweb, Excite en Yahoo
Search Engine Players, A Brief History: verouderde informatie over AltaVista, Excite, Hotbot en Webcrawler


Sergey Brin van de
Computer Science Department, Stanford University schreef in 2000 een document over
"The Anatomy of a Large-Scale Hypertextual Web Search Engine". Dit is de uitgebreide versie van zijn studie. Het beschrijft de doelen van Google en gaat uitgebreid in op populariteitsmeting van websites, in vaktaal "Page Ranking"

The Anatomy of a Large-Scale Hypertextual Web Search Engine

1996: "I have a web robot which is a Java app."
1997: "BackRub is a 'web crawler' which is designed to traverse the web."
1997: "If your question is not answered here [...] call [] and ask for Larry."
1997: " This is a demo of the Google Search Engine. [...] Number of Web Pages Fetched: 24 million"
1998: "Current Repository Size: ~25 million pages"
1998: "Index contains ~25 million pages (soon to be much bigger)"
1998: Google.com
1998: "Google! BETA"
1999: "Google can make you feel lucky!"
1999: "Here is our main logo full size, created using GIMP. If you want to hack on it, here is the XCF file."
1999:
" Google Receives $25 Million in Equity Funding"
 




Het noodlijdende AltaVista, lange tijd de lieveling van professionele researchers, schrijft in AltaVista Site White Paper meer over Scooter, de automatische robot die websites besnuffelt. (Scooter bestaat nog steeds). Dit document was lange tijd niet meer verkrijgbaar op internet!

AltaVista Site White Paper

Unlocking the true value and promise of the Internet

Introduction

Traversing the Internet has always been a bit like exploring outer space. One could wander indiscriminately and make many useful discoveries. Just as easily, though, hours (thankfully, not light-years) could pass with nothing of value to show for the effort.

Like its figurative cousin, the cyberspace of the Internet is vast; but unlike the world of Voyager and Magellan, the Internet is finite. And while the Internet remains essentially unstructured, it is possible-with enough sophistication and power-to catalogue the entire realm. To index every word on every page of every Web site. To bring order and meaning to an otherwise unwieldy behemoth.

Digital has proved it can be done-and has done it!

The AltaVista site-launched on December 15, 1995-radically altered the way we view and use the Internet. On the surface, one sees a simple searching interface, not unlike other tools available through a standard Web browser. Behind the scenes, however, the world's most sophisticated "web crawler" software and most powerful computers have compiled (and continue to update) a complete index of the entire Internet. For the first time, it is possible to find and retrieve useful information from across the vast expanse of the Internet in seconds.

AltaVista has changed everything. It is no longer necessary to know the address of a particular home page, only to begin following the trail of hyperlinks to your eventual goal. AltaVista takes you to precisely where you want to be from the start-pointing you to relevant Web pages regardless of where they reside on a particular site. You can then follow the links from there as desired. The painstaking task of classifying every Web page into logical groups is a thing of the past, before it even became a full reality. Today, AltaVista puts the entire contents of the Internet at your fingertips, transforming it into a bona fide business, education, and entertainment resource. 

A higher view of the Internet

What exactly is AltaVista? It is a whole new class of Internet technology, developed in the research laboratories of Digital Equipment Corporation. To better understand AltaVista, let's first look at it from the user's perspective.

 

Finding useful information on the Internet in seconds

AltaVista, from a user's point of view, is a query system for finding useful information on the Internet. Accessible through the World Wide Web from any standard Web browser, AltaVista provides a simple interface for entering a few words pertaining to your topic of interest. AltaVista then produces a prioritized list of all the Web pages that contain at least one mention of the word or words in your inquiry-the more mentions, the higher the priority in the listing. What's more, each reference in the list is hyperlinked to the actual Web page, so you simply click and you're there. Bear in mind, the entire process takes only seconds.

Want to know the annual rainfall in Nepal? How about the latest earnings from Fifth Third Bancorp? Care to hear your favorite musician talk about her latest release? Or watch a clip from her video? Looking for the latest breakthrough on osteomalacia? Need data on the effects of carbon monoxide on evergreens? Want to find an old friend? Or catch up on the news from down under?

If it's on the Internet, you can find it in seconds using AltaVista. Moreover, if something new appears on the Internet tomorrow-or the next day, or any day-AltaVista will know about it. That's because AltaVista maintains a comprehensive database of every word on every Web page on the Internet-and this database is growing constantly.

At last count, the AltaVista database contained 11 billion words indexed from over 22 million Web pages. This is a database of practically everything on the Internet (some pages are excluded, as discussed below). And it's accessible in an instant.

 

How do they do that?

The technology behind AltaVista is truly revolutionary. It's a combination of super-sophisticated software and super-fast computers.

Collecting data and making it useful

The software consists of the query tool described above, a data collector (a.k.a. web crawler, spider, robot), and an indexer. The AltaVista data collector, dubbed Scooter, is the fastest known web crawler in existence. Scooter looks at 2.5 million Web pages per day, every day, and brings back the contents of those pages to its host computer for indexing.

Scooter is known as a "polite" web crawler; that is, it obeys the rules of the Standard for Robot Exclusion. This means that Scooter checks a special file at each Web site before visiting any of its pages. This file may contain a listing of certain pages that the site's Webmaster does not want traversed by a web crawler. Scooter will not fetch any pages on the list.

Scooter is polite in another respect, too. It is simultaneously accessing and fetching thousands of Web pages at a time. Yet, it imposes minimal load on a Web server, to avoid inconveniencing the site in way. To accomplish this, Scooter waits after performing a fetch before it retrieves another page from the same site. By invoking a delay that is a factor of 100 times the duration of the fetch, Scooter accesses slower systems much less frequently that fast ones. In fact, Scooter never uses more than 1% of the resources of a given system while it is retrieving pages.

Now, what do we do with all this content? Enter the Ni2 indexing software. Ni2 indexes an astounding 1 gigabyte of text per hour, producing links to every word on every Web page brought back by Scooter. The Ni2 index is the key software that allows you to enter a few words in the query interface and instantly retrieve a listing of relevant Web pages.

One of the most important features of Ni2 is its ranking system-the means of prioritizing the results of a query. Ni2 looks through the documents in the index and considers each document that includes at least one of the words in your query. Then, using a method known as "collection frequency weighting," it calculates a score for each matching document, placing the most relevant and useful documents at the top of the list.

Providing results in an instant

The AltaVista software is optimized for Digital's 64-bit Alpha technology, which enables the query interface, Scooter, and Ni2 indexer to perform at unbelievable speeds.

To process inquiries from the AltaVista site, we use a pair of AlphaStation 250 4/266 systems, each with 256 MB of RAM and 4 GB of hard disk. Running on the AlphaStation systems is a custom multi-threaded Web server, which sends queries to the Web indexer and News indexer. With just these two relatively small systems we easily handle up to six million hits per day to the AltaVista site. About 90% of the queries are to search the Web and approximately 10% for newsgroups.

Scooter runs on a DEC3000/900 AlphaStation with 1GB RAM and 30 GB of hard disk with RAID 5 to ensure data integrity. The sole job of this computer is crawling the Web, fetching content and sending it to the Web indexer.

The Ni2 Web indexer runs on an AlphaServer 8400 5/300 (a.k.a. TurboLaser), which includes 10 processors, 6 GB of RAM, and 210 GB of hard disk with RAID 5. The TurboLaser is the most powerful computer built by Digital, holding a Web index that is larger than 30 GB, while providing responses to most requests in less than a second.

Our News Server runs on an AlphaStation 400 4/233 system, with 160 MB RAM and 24 GB of hard disk with RAID 5. This server maintains a current news spool for the News Indexer and serves the articles via http to those users who simply want to read news using the ease of their standard Web browser.

The New Indexer runs on an AlphaStation 250 4/266 system, with 196 MB of RAM and 13 GB or hard disk. This machine keeps an up-to-date index of the news spool, handling the constant turnover of thousands of news articles to ensure the most current information is presented when you make a query.

 

AltaVista-A Brief History

Where did AltaVista come from? AltaVista emerged from Digital's Palo Alto research laboratories in the spring of 1995. Begun as a way to demonstrate the sheer power of the AlphaServer 8400 TurboLaser computer running an Oracle database, the demo quickly bloomed into a research challenge to achieve the impossible.

Defying common wisdom, a small band of researchers in Palo Alto set out to index the entire Web-something long considered unobtainable. Yet, with benchmark results showing the TurboLaser and Oracle performing 100 times faster than any of the nearest competitors, nothing seemed impossible.

 

The right place at the right time

It was a research environment that fostered the growth of an innovative idea. And it was the personality of one research facility in particular that transformed that idea into a practical reality. Digital's Palo Alto lab is filled with forward-looking scientists and engineers-people who feel an urgency to move their projects into the real world as products or technology demonstrations.

The original idea for AltaVista came from Paul Flaherty, who saw it as a way to showcase the Very Large Memory Database capability of the TurboLaser/Oracle duo by indexing the Internet. With Digital's experience in 64-bit computing and collective expertise with Internet technologies, the foundation was already laid. The state of the Internet was in high growth, but still without much structure. So the timing was ideal. What was needed was a way to fetch Web pages and index them.

Louis Monier developed the super-spider, Scooter, from scratch with the sole intention of making AltaVista a reality. This software is simply the result of Louis' skill and the brain trust of Digital's research labs. All the resources needed to create the world's fastest and most complete web crawler were there in one place. In fact, there is probably nowhere else in the world that could have supported this effort.

The Ni2 Web indexing software started out as a way for Mike Burrows to organize his personal e-mail. Then, in response to a dare back in 1991, he refined the code to index all of the articles in all of the newsgroups on the Internet. Mike's original approach scaled very well and proved capable of handling an enormous amount of text-a million documents could be indexed. With a bit more "tinkering," he revised the code further, making the indexer useful and fast in dealing with an even larger set of documents. It was just good timing that Mike finished the second incarnation of the indexer-Ni2-when the AltaVista project was taking off. It proved to be the perfect test of his new software with the huge set of data that Scooter was returning from the Web.

Through a combination of ideal research conditions and excellent timing, AltaVista had arrived.

 

AltaVista-A Long Future

AltaVista grew out of pure research, but it was no accident. It was borne of an environment that values creativity and rewards practicality.

AltaVista, today, is without doubt a showcase for the technological leadership of Digital computers and software. And it will remain so. But it also holds fundamental value as a tool for using the Internet in our every-day business and personal lives. Its sudden and remarkable popularity-attracting over six million hits per day without any promotion-is a testament to that fact.

AltaVista marks a turning point in the way we view and use the Internet. It makes it possible for anyone to gain value from resources on the Internet, without wasting hour after hour in the process. It also has implications for how Web sites are structured. For instance, the home page, which has been the traditional first point of entry to a site-defining it, setting the tone, and providing links to further detail-may never be seen by the typical visitor. Through AltaVista, Web travelers could land on a page anywhere in a particular site, based on their specific interest. With AltaVista, in fact, the entire Web is treated as one huge site-the only home page left is AltaVista itself. It has proved to be the most natural starting point of all.

 

More of a good thing-mirror sites

Where do we go from here? With the tremendous growth in popularity of the single AltaVista site in Palo Alto-the number of daily hits grew by more than 2,000 percent in just three months-the natural course of action is to deliver more of a good thing. We will start by establishing mirror sites around the world.

An AltaVista mirror site exactly replicates the index and search capabilities of the Palo Alto site. Mirror sites will be licensed using a franchise model, which guarantees exclusive rights for the owner to provide AltaVista services within a geographical territory. While mirror sites will harness AltaVista technology, the format and layout of their Web pages will be managed by the mirror site owner. But for anyone using a mirror site, you will still recognize the AltaVista "look and feel"-with the same quality, performance, and availability that makes the original site so useful.

With the development of mirror sites, AltaVista will be available to more Internet users than ever before, all around the world-truly establishing it as the leading Internet search engine. Particularly for users outside of the U.S., response times will increase significantly. And with geographic distribution of mirror sites, we can support local languages and reflect the culture of the region.

 

Getting even more useful information

The research environment that produced AltaVista is alive and brimming with new ideas, new technological breakthroughs. This is sure to produce expanded capabilities for the existing site and future mirror sites.

Inevitably, the Web index will continue to grow, as more sites appear and as Scooter penetrates the Web even deeper than it is currently. In fact, rather than performing periodic crawls, Scooter will soon crawl the Web continuously, updating the index even more frequently than is possible today. In addition, Digital researchers are tackling the difficult challenge of indexing Asian Web pages, where multiple character encodings present unique difficulties. As we have seen, though, nothing is impossible. And the result of this extra effort simply means a greater wealth of information and resources available to more and more Internet users.

Another enhancement to the site will be new ease-of-use features to assist the user in refining queries for more precise results. For example, a query on "bonds" might prompt the user to clarify:

financial bonds

adhesives

James Bond

etc.

Through such prompting and refinement, AltaVista will make it simpler for a new user and more effective for advanced users to produce valuable results. Another refinement technique may take advantage of "query builds"-using the query mechanism for multiple searches built on each other. By successively refining each query, the user can narrow down the number of matches and, again, focus in on the most pertinent results.

 

Adding value to commercial content

There is also life for AltaVista beyond the search site. The powerful search and indexing technology in AltaVista is written in portable code, which has huge implications for where and how this technology can be utilized.

One of the first likely applications is as a value-added capability to selected content sites on the Internet. In this scenario, a content site would incorporate AltaVista with some level of added value beyond simply returning a list of Web links. The "raw" data produced by AltaVista would undergo post processing and reformatting by the host site. For example, a publishing company could provide search capabilities across various categories of book titles. The results of a search might produce a listing of matching titles, additional titles by the same author, other selections with similar subject matter, etc. The possibilities here are nearly limitless.

 

AltaVista at work

If we can index the entire World Wide Web, and help the average Internet surfer find practically anything in a matter of seconds, just think what we could do for today's corporations, universities, and government agencies.

With all that Internet technologies have brought to the public arena, many enterprises are recognizing the value of adopting the same capabilities inside their operations-on private intranets. Many have set up TCP/IP networks (the same as the Internet) and are populating desktops with standard Web browsers, such as Netscape.

The number of intranet Web pages is growing ever more rapidly. For anyone on the enterprise network-as on the Internet-spending time just in pursuit of information draws directly away from productive endeavors. And unless information can be located and retrieved, its value diminishes significantly. AltaVista technology will be adapted for enterprises, to help them unlock the full value of all the information on their intranets.

AltaVista Team Edition

A "Team" edition of AltaVista, for instance, will provide an AltaVista query interface allowing browser users to instantly find useful information anywhere on their private intranet. A Scooter-like crawler will crawl the intranet, fetching internal Web pages. In turn, an Ni2-like indexer will compile a complete, full-text index of the entire body of material. The Team edition will only crawl behind the corporate firewall and, initially, will only fetch HTML (Web) information. As the technology evolves, however, all corporate data, regardless of format, will be crawled and indexed.

AltaVista Enterprise Edition

An "Enterprise" edition will extend the reach of the intranet crawl out onto the Internet. This will be a tailored solution, enabling customers to specify a particular subset of the Internet to be crawled. In addition, the Enterprise edition will enable distributed enterprises, with numerous remote, branch, or home workers connected across an Internet backbone, to be included in the crawl. The result will produce a thorough index of all information in an enterprise-unlocking hidden assets and maximizing the value contribution of every individual.

AltaVista Personal Edition

Also planned is a "Personal" edition of AltaVista, designed for the individual desktop. The Personal edition will offer the same robust data gathering, indexing, and query capabilities as the original AltaVista, specially tuned and packaged for personal use. The key here, as with all AltaVista search tools, is speed and depth. No other desktop tool on the market could match the performance of AltaVista. You will have a single interface to query all the information on your desktop-regardless of format-and locate relevant files in seconds. Imagine being able to instantly find a choice piece of data, buried in an e-mail message from three years ago. That's the power of AltaVista brought to your desktop.

 

Conclusion

AltaVista technology has spawned a whole new vision of the Internet, providing a higher view from which to locate and access valuable resources. And it has inspired broader use of Internet technologies across the enterprise and around the world.

Digital is building on the success of AltaVista, with an entire family of software products. Using Internet technology as a common, ubiquitous environment, AltaVista software provides users with dynamic, global capabilities for exchanging information and ideas. It combines the vast resources of the Internet, the navigational ease of standard Web browsers, and the value of existing intranet assets.

AltaVista is truly revolutionary in scope, breaking through traditional barriers that limit communication and information access. It is both a vision and a reality, offering immediate rewards as it inspires new, innovative technology. It is a key that unlocks all the world of the Internet has to offer. So wherever your business takes you, start your journey with AltaVista.