|
Het
begin van de zoekmachine
Elke zoekmachine
is klein begonnen. Voelspriet kijkt voor de fijnproevers terug naar het
begin. Lees alles over de oorspronkelijke techniek achter de zoekmachine
en de reden van oprichting.
Hoe het begon
De grootvader van alle zoekmachines, Archie, is
opgericht in 1990. Lees alles over de eerste zoekmachine Archie en diens
opvolgers.
A History Of Search Engines
: over Archie, Aliweb, Excite en
Yahoo
Search Engine Players, A Brief History:
verouderde informatie
over AltaVista, Excite, Hotbot en Webcrawler
Sergey Brin van de Computer Science Department,
Stanford University
schreef in 2000 een document over
"The Anatomy of a
Large-Scale Hypertextual Web Search Engine". Dit is de
uitgebreide versie van zijn studie. Het beschrijft de doelen van Google en
gaat uitgebreid in op populariteitsmeting van websites, in vaktaal "Page
Ranking"
The Anatomy of a Large-Scale Hypertextual Web Search Engine
1996:
"I have a web robot which is a Java app."
1997:
"BackRub is a 'web crawler' which is designed to traverse the web."
1997:
"If your question is not answered here [...] call [] and ask for Larry."
1997:
" This is a demo of the Google Search Engine. [...] Number of Web Pages
Fetched: 24 million"
1998:
"Current Repository Size: ~25 million pages"
1998:
"Index contains ~25 million pages (soon to be much bigger)"
1998:
Google.com
1998:
"Google! BETA"
1999:
"Google can make you feel lucky!"
1999:
"Here is our main logo full size, created using GIMP. If you want to hack
on it, here is the XCF file."
1999:
" Google Receives $25 Million in Equity Funding"

Het noodlijdende AltaVista, lange tijd de lieveling van
professionele researchers, schrijft in AltaVista Site
White Paper meer
over Scooter, de automatische robot die websites besnuffelt. (Scooter
bestaat nog steeds). Dit document was lange tijd
niet meer verkrijgbaar op internet!
AltaVista Site White Paper
Unlocking the true value and promise of the
Internet
Introduction
Traversing the Internet has always been a bit
like exploring outer space. One could wander indiscriminately and make
many useful discoveries. Just as easily, though, hours (thankfully, not
light-years) could pass with nothing of value to show for the effort.
Like its figurative cousin, the cyberspace of
the Internet is vast; but unlike the world of Voyager and Magellan, the
Internet is finite. And while the Internet remains essentially
unstructured, it is possible-with enough sophistication and power-to
catalogue the entire realm. To index every word on every page of every
Web site. To bring order and meaning to an otherwise unwieldy behemoth.
Digital has proved it can be done-and has done
it!
The AltaVista site-launched on December 15,
1995-radically altered the way we view and use the Internet. On the
surface, one sees a simple searching interface, not unlike other tools
available through a standard Web browser. Behind the scenes, however,
the world's most sophisticated "web crawler" software and most powerful
computers have compiled (and continue to update) a complete index of the
entire Internet. For the first time, it is possible to find and retrieve
useful information from across the vast expanse of the Internet in
seconds.
AltaVista has changed everything. It is no
longer necessary to know the address of a particular home page, only to
begin following the trail of hyperlinks to your eventual goal. AltaVista
takes you to precisely where you want to be from the start-pointing you
to relevant Web pages regardless of where they reside on a particular
site. You can then follow the links from there as desired. The
painstaking task of classifying every Web page into logical groups is a
thing of the past, before it even became a full reality. Today,
AltaVista puts the entire contents of the Internet at your fingertips,
transforming it into a bona fide business, education, and
entertainment resource.
A higher view of the Internet
What exactly is AltaVista? It is a whole new
class of Internet technology, developed in the research laboratories of
Digital Equipment Corporation. To better understand AltaVista, let's
first look at it from the user's perspective.
Finding useful information on the Internet in
seconds
AltaVista, from a user's point of view, is a
query system for finding useful information on the Internet. Accessible
through the World Wide Web from any standard Web browser, AltaVista
provides a simple interface for entering a few words pertaining to your
topic of interest. AltaVista then produces a prioritized list of all the
Web pages that contain at least one mention of the word or words in your
inquiry-the more mentions, the higher the priority in the listing.
What's more, each reference in the list is hyperlinked to the actual Web
page, so you simply click and you're there. Bear in mind, the entire
process takes only seconds.
Want to know the annual rainfall in Nepal? How
about the latest earnings from Fifth Third Bancorp? Care to hear your
favorite musician talk about her latest release? Or watch a clip from
her video? Looking for the latest breakthrough on osteomalacia? Need
data on the effects of carbon monoxide on evergreens? Want to find an
old friend? Or catch up on the news from down under?
If it's on the Internet, you can find it in
seconds using AltaVista. Moreover, if something new appears on the
Internet tomorrow-or the next day, or any day-AltaVista will know about
it. That's because AltaVista maintains a comprehensive database of every
word on every Web page on the Internet-and this database is growing
constantly.
At last count, the AltaVista database contained
11 billion words indexed from over 22 million Web pages. This is a
database of practically everything on the Internet (some pages are
excluded, as discussed below). And it's accessible in an instant.
How do they do that?
The technology behind AltaVista is truly
revolutionary. It's a combination of super-sophisticated software and
super-fast computers.
Collecting data and making it useful
The software consists of the query tool
described above, a data collector (a.k.a. web crawler, spider, robot),
and an indexer. The AltaVista data collector, dubbed Scooter, is the
fastest known web crawler in existence. Scooter looks at 2.5 million Web
pages per day, every day, and brings back the contents of those pages to
its host computer for indexing.
Scooter is known as a "polite" web crawler; that
is, it obeys the rules of the Standard for Robot Exclusion. This means
that Scooter checks a special file at each Web site before visiting any
of its pages. This file may contain a listing of certain pages that the
site's Webmaster does not want traversed by a web crawler. Scooter will
not fetch any pages on the list.
Scooter is polite in another respect, too. It is
simultaneously accessing and fetching thousands of Web pages at a time.
Yet, it imposes minimal load on a Web server, to avoid inconveniencing
the site in way. To accomplish this, Scooter waits after performing a
fetch before it retrieves another page from the same site. By invoking a
delay that is a factor of 100 times the duration of the fetch, Scooter
accesses slower systems much less frequently that fast ones. In fact,
Scooter never uses more than 1% of the resources of a given system while
it is retrieving pages.
Now, what do we do with all this content? Enter
the Ni2 indexing software. Ni2 indexes an astounding 1 gigabyte of text
per hour, producing links to every word on every Web page brought back
by Scooter. The Ni2 index is the key software that allows you to enter a
few words in the query interface and instantly retrieve a listing of
relevant Web pages.
One of the most important features of Ni2 is its
ranking system-the means of prioritizing the results of a query. Ni2
looks through the documents in the index and considers each document
that includes at least one of the words in your query. Then, using a
method known as "collection frequency weighting," it calculates a score
for each matching document, placing the most relevant and useful
documents at the top of the list.
Providing results in an instant
The AltaVista software is optimized for
Digital's 64-bit Alpha technology, which enables the query interface,
Scooter, and Ni2 indexer to perform at unbelievable speeds.
To process inquiries from the AltaVista site, we
use a pair of AlphaStation 250 4/266 systems, each with 256 MB of RAM
and 4 GB of hard disk. Running on the AlphaStation systems is a custom
multi-threaded Web server, which sends queries to the Web indexer and
News indexer. With just these two relatively small systems we easily
handle up to six million hits per day to the AltaVista site. About 90%
of the queries are to search the Web and approximately 10% for
newsgroups.
Scooter runs on a DEC3000/900 AlphaStation with
1GB RAM and 30 GB of hard disk with RAID 5 to ensure data integrity. The
sole job of this computer is crawling the Web, fetching content and
sending it to the Web indexer.
The Ni2 Web indexer runs on an AlphaServer 8400
5/300 (a.k.a. TurboLaser), which includes 10 processors, 6 GB of RAM,
and 210 GB of hard disk with RAID 5. The TurboLaser is the most powerful
computer built by Digital, holding a Web index that is larger than 30
GB, while providing responses to most requests in less than a second.
Our News Server runs on an AlphaStation 400
4/233 system, with 160 MB RAM and 24 GB of hard disk with RAID 5. This
server maintains a current news spool for the News Indexer and serves
the articles via http to those users who simply want to read news using
the ease of their standard Web browser.
The New Indexer runs on an AlphaStation 250
4/266 system, with 196 MB of RAM and 13 GB or hard disk. This machine
keeps an up-to-date index of the news spool, handling the constant
turnover of thousands of news articles to ensure the most current
information is presented when you make a query.
AltaVista-A Brief History
Where did AltaVista come from? AltaVista emerged
from Digital's Palo Alto research laboratories in the spring of 1995.
Begun as a way to demonstrate the sheer power of the AlphaServer 8400
TurboLaser computer running an Oracle database, the demo quickly bloomed
into a research challenge to achieve the impossible.
Defying common wisdom, a small band of
researchers in Palo Alto set out to index the entire Web-something long
considered unobtainable. Yet, with benchmark results showing the
TurboLaser and Oracle performing 100 times faster than any of the
nearest competitors, nothing seemed impossible.
The right place at the right time
It was a research environment that fostered the
growth of an innovative idea. And it was the personality of one research
facility in particular that transformed that idea into a practical
reality. Digital's Palo Alto lab is filled with forward-looking
scientists and engineers-people who feel an urgency to move their
projects into the real world as products or technology demonstrations.
The original idea for AltaVista came from Paul
Flaherty, who saw it as a way to showcase the Very Large Memory Database
capability of the TurboLaser/Oracle duo by indexing the Internet. With
Digital's experience in 64-bit computing and collective expertise with
Internet technologies, the foundation was already laid. The state of the
Internet was in high growth, but still without much structure. So the
timing was ideal. What was needed was a way to fetch Web pages and index
them.
Louis Monier developed the super-spider,
Scooter, from scratch with the sole intention of making AltaVista a
reality. This software is simply the result of Louis' skill and the
brain trust of Digital's research labs. All the resources needed to
create the world's fastest and most complete web crawler were there in
one place. In fact, there is probably nowhere else in the world that
could have supported this effort.
The Ni2 Web indexing software started out as a
way for Mike Burrows to organize his personal e-mail. Then, in response
to a dare back in 1991, he refined the code to index all of the articles
in all of the newsgroups on the Internet. Mike's original approach
scaled very well and proved capable of handling an enormous amount of
text-a million documents could be indexed. With a bit more "tinkering,"
he revised the code further, making the indexer useful and fast in
dealing with an even larger set of documents. It was just good timing
that Mike finished the second incarnation of the indexer-Ni2-when the
AltaVista project was taking off. It proved to be the perfect test of
his new software with the huge set of data that Scooter was returning
from the Web.
Through a combination of ideal research
conditions and excellent timing, AltaVista had arrived.
AltaVista-A Long Future
AltaVista grew out of pure research, but it was
no accident. It was borne of an environment that values creativity and
rewards practicality.
AltaVista, today, is without doubt a showcase
for the technological leadership of Digital computers and software. And
it will remain so. But it also holds fundamental value as a tool for
using the Internet in our every-day business and personal lives. Its
sudden and remarkable popularity-attracting over six million hits per
day without any promotion-is a testament to that fact.
AltaVista marks a turning point in the way we
view and use the Internet. It makes it possible for anyone to gain value
from resources on the Internet, without wasting hour after hour in the
process. It also has implications for how Web sites are structured. For
instance, the home page, which has been the traditional first point of
entry to a site-defining it, setting the tone, and providing links to
further detail-may never be seen by the typical visitor. Through
AltaVista, Web travelers could land on a page anywhere in a particular
site, based on their specific interest. With AltaVista, in fact, the
entire Web is treated as one huge site-the only home page left is
AltaVista itself. It has proved to be the most natural starting point of
all.
More of a good thing-mirror sites
Where do we go from here? With the tremendous
growth in popularity of the single AltaVista site in Palo Alto-the
number of daily hits grew by more than 2,000 percent in just three
months-the natural course of action is to deliver more of a good thing.
We will start by establishing mirror sites around the world.
An AltaVista mirror site exactly replicates the
index and search capabilities of the Palo Alto site. Mirror sites will
be licensed using a franchise model, which guarantees exclusive rights
for the owner to provide AltaVista services within a geographical
territory. While mirror sites will harness AltaVista technology, the
format and layout of their Web pages will be managed by the mirror site
owner. But for anyone using a mirror site, you will still recognize the
AltaVista "look and feel"-with the same quality, performance, and
availability that makes the original site so useful.
With the development of mirror sites, AltaVista
will be available to more Internet users than ever before, all around
the world-truly establishing it as the leading Internet search engine.
Particularly for users outside of the U.S., response times will increase
significantly. And with geographic distribution of mirror sites, we can
support local languages and reflect the culture of the region.
Getting even more useful information
The research environment that produced AltaVista
is alive and brimming with new ideas, new technological breakthroughs.
This is sure to produce expanded capabilities for the existing site and
future mirror sites.
Inevitably, the Web index will continue to grow,
as more sites appear and as Scooter penetrates the Web even deeper than
it is currently. In fact, rather than performing periodic crawls,
Scooter will soon crawl the Web continuously, updating the index even
more frequently than is possible today. In addition, Digital researchers
are tackling the difficult challenge of indexing Asian Web pages, where
multiple character encodings present unique difficulties. As we have
seen, though, nothing is impossible. And the result of this extra effort
simply means a greater wealth of information and resources available to
more and more Internet users.
Another enhancement to the site will be new
ease-of-use features to assist the user in refining queries for more
precise results. For example, a query on "bonds" might prompt the user
to clarify:
financial
bonds
adhesives
James
Bond
etc.
Through such prompting and refinement, AltaVista
will make it simpler for a new user and more effective for advanced
users to produce valuable results. Another refinement technique may take
advantage of "query builds"-using the query mechanism for multiple
searches built on each other. By successively refining each query, the
user can narrow down the number of matches and, again, focus in on the
most pertinent results.
Adding value to commercial content
There is also life for AltaVista beyond the
search site. The powerful search and indexing technology in AltaVista is
written in portable code, which has huge implications for where and how
this technology can be utilized.
One of the first likely applications is as a
value-added capability to selected content sites on the Internet. In
this scenario, a content site would incorporate AltaVista with some
level of added value beyond simply returning a list of Web links. The "raw"
data produced by AltaVista would undergo post processing and
reformatting by the host site. For example, a publishing company could
provide search capabilities across various categories of book titles.
The results of a search might produce a listing of matching titles,
additional titles by the same author, other selections with similar
subject matter, etc. The possibilities here are nearly limitless.
AltaVista at work
If we can index the entire World Wide Web, and
help the average Internet surfer find practically anything in a matter
of seconds, just think what we could do for today's corporations,
universities, and government agencies.
With all that Internet technologies have brought
to the public arena, many enterprises are recognizing the value of
adopting the same capabilities inside their operations-on private
intranets. Many have set up TCP/IP networks (the same as the Internet)
and are populating desktops with standard Web browsers, such as
Netscape.
The number of intranet Web pages is growing ever
more rapidly. For anyone on the enterprise network-as on the
Internet-spending time just in pursuit of information draws directly
away from productive endeavors. And unless information can be located
and retrieved, its value diminishes significantly. AltaVista technology
will be adapted for enterprises, to help them unlock the full value of
all the information on their intranets.
AltaVista Team Edition
A "Team" edition of AltaVista, for instance,
will provide an AltaVista query interface allowing browser users to
instantly find useful information anywhere on their private intranet. A
Scooter-like crawler will crawl the intranet, fetching internal Web
pages. In turn, an Ni2-like indexer will compile a complete, full-text
index of the entire body of material. The Team edition will only crawl
behind the corporate firewall and, initially, will only fetch HTML (Web)
information. As the technology evolves, however, all corporate data,
regardless of format, will be crawled and indexed.
AltaVista Enterprise Edition
An "Enterprise" edition will extend the reach of
the intranet crawl out onto the Internet. This will be a tailored
solution, enabling customers to specify a particular subset of the
Internet to be crawled. In addition, the Enterprise edition will enable
distributed enterprises, with numerous remote, branch, or home workers
connected across an Internet backbone, to be included in the crawl. The
result will produce a thorough index of all information in an
enterprise-unlocking hidden assets and maximizing the value contribution
of every individual.
AltaVista Personal Edition
Also planned is a "Personal" edition of
AltaVista, designed for the individual desktop. The Personal edition
will offer the same robust data gathering, indexing, and query
capabilities as the original AltaVista, specially tuned and packaged for
personal use. The key here, as with all AltaVista search tools, is speed
and depth. No other desktop tool on the market could match the
performance of AltaVista. You will have a single interface to query all
the information on your desktop-regardless of format-and locate relevant
files in seconds. Imagine being able to instantly find a choice piece of
data, buried in an e-mail message from three years ago. That's the power
of AltaVista brought to your desktop.
Conclusion
AltaVista technology has spawned a whole new
vision of the Internet, providing a higher view from which to locate and
access valuable resources. And it has inspired broader use of Internet
technologies across the enterprise and around the world.
Digital is building on the success of AltaVista,
with an entire family of software products. Using Internet technology as
a common, ubiquitous environment, AltaVista software provides users with
dynamic, global capabilities for exchanging information and ideas. It
combines the vast resources of the Internet, the navigational ease of
standard Web browsers, and the value of existing intranet assets.
AltaVista is truly revolutionary in scope,
breaking through traditional barriers that limit communication and
information access. It is both a vision and a reality, offering
immediate rewards as it inspires new, innovative technology. It is a key
that unlocks all the world of the Internet has to offer. So wherever
your business takes you, start your journey with AltaVista.
|