by: Albert Benschop
There are millions of Web pages out there. The latest estimation (march 1998) is that there are around 275 million pages of information on the web (against 110 million medio 1997). Nobody really knows. The Web is growing and mutating very fast, which is to be expected when there are so many people able to communicate online. There are up to 20 million new pages per month. If this phenomenal growth rate is continued, everyone on Earth would have their own personal page on the Web in four years time.
Until recently, surfing was a typical approach for finding information on the Web. Surfing is unstructured and serendipitous browsing: you start with a particular Web page and follow links from page to page, make educated guesses along the way, hoping sooner or later to arrive at the desired piece of information. Surfing is browsing without tools. Surfing is fun when you have the time to explore. But if you need to find a specific piece of information quickly, or need to find that same information again, surfing and serendipity soon lose their charm.
As the WWW has grown it has become necessary to provide a quick and easy method of rapidly searching webspace. Search tools - often known as search engines - have been developed which can perform this activity. Search engines provide a front end to a database of indexed WWW resources, into which search keywords can be typed.
The number of search tools available on the WWW has grown quickly over a very recent period. This has posed new problems for WWW users. There is now a bewildering variety of search tools available - each offering different features and interfaces. Many are linked to sizeable catalogs of WWW resources, and some claim to offer a comprehensive index of the entire WWW. Some search just on machine names or directory and file names (the URL's) while others search on titles and headers of HTML-pages as well. Some allow searching several different indexes.
The upshot here is (a) that there's no one best way of searching, and (b) that there's no perfect searching tool. All search tools and engines have their strengths and weaknesses. So your best bet is to learn how to use an entire arsenal of them. Most of us end up preferring different suites for finding tools depending on searching effectiveness within the chosen subject domain of interest and personal searching styles. For your subject interests, you actually need to explore and determine for yourself which works best for your purposes.
A subject catalog contains substantial numbers of links to Internet resources organized via subject categories created by someone familiar both with the topic and how people would seek information within it. It is an intelligently designed 'links library' or 'links index' that has been organized and compiled by subject experts. The intent is to guide searchers within a high-quality, large domain of selected resources.
Subject catalogs or virtual libraries are often organized hierarchically to make it easier to navigate from the general to the specific topic of interst. Well written catalogs also contain cross-references between related topics under different headings.
The searchable domain of subject catalogs is smaller than that of most navigators and quality is dependent on the subject expertise and Internet experience of those doing the selecting. Subject catalogs are like subject guides but usually much more comprehensive and less narrative.
General Virtual Libraries Some subject catalogs or virtual libraries present their links with brief annotations, and are typically large with minimal restrictions as to what will be accepted for inclusion. Well-known virtual libraries that present links with brief annotations are: Galaxy, Infomine, Internet Public Library, Internet Sleuth (it's not only a search engine), Planet Earth, WWW Virtual Library Series Subject Catalog, WebSurfer, and the most popular Yahoo.
Reviewing Virtual Libraries Some virtual libraries provide significant added value to each link with commentaries and ratings provided by skilled reviewers. Here are some examples: NetReviews (form Excite), Magellan, Point Communications, and WIC (formerly GNN's Whole Internet Catalog).
Subject-specific Guides Subject-specific virtual libraries that function as subject bibliographies to Internet resources are being authored by subject specialists. These subject guides are specialized subject trees for specific disciplines. Examples are: ArchNet WWW Virtual Library (for archeology), and the Clearinghouse for Subject-Oriented Internet Resource Guides (for sociologist they provide access to: Social Sciences and Social Issues).
Geographical Guides A special kind of subject catalogs are the geographical guides. In these guides you can search in continents, countries, regions, cities etc. Most of them use a series of increasingly more specific clickable maps that allow you to get the locale you wish to visit. This is browsing in geographical space. Examples are: CityNet, Virtual Tourist2, and the GeoSurfer.
The words robot, spider, crawler, wanderer and worm are all used to describe computer programs which are designed to explore and compile information about the Web. These programs usually have a database to organize the data about the sites they encounter. Often, the database is put on the Web so that you the user can search it. Because each robot is programmed to search the Web in a different way, the information stored in each database can be very different.
A Web spider or robot examines a document and indexes it, or enters it into the database, based on words extracted from the title or text. In addition the software also searches the document for pointers or URLs for other documents that haven't been indexed yet. Search engines work on the principle that the information content of a document can be summarized by extracting those words already present in the title or text. By ranking the extracted text by its position in title or text, the number of times it appears in the document, and other criteria, the database reduces the number of incidental words or phrases (know as 'false drops'), from those relevant to the topic.
Searchers can connect to a search engine site and enter keywords to query the index. Web pages and other Internet resources that satisfy the query are identified and listed.
Not all search engines are created equally. Search engines vary according to the size of the index, the frequency of updating the index, search options, the speed of returning a result set, the result set presentation, the relevancy of the items included in a result set, and the overall ease of use.
There are several types of navigators or search engines:
Uniform Meta-Search Engines There is a special kind of navigators that allow you to search other search engines. They are usually called meta-search engines.They allow you to search enormous domains of servers and documents from one point. The multi-threaded search engines are the most intelligent navigators. I will call them 'Uniform' meta-search engines, because they typically use only one form to call out the search engines. They allow you to put in one search that will be automatically and often simultaneously conducted in several large search engines. Examples of good meta-search engines with a unified interface are: All4One, Dogpile, iFind, Javabot, MetaCrawler, SavvySearch.
Multiform Meta-Search Engines The second sort of mega-search engines are compilations or multi-form front-ends for other search engines. These meta-search engines offer, from one site, forms for entering searches in several navigators in serial fashion. This may be convenient, but sometimes you need to intervene and tune your search as the display features of the individual navigators allow. Examples are: All-in-One (compilation of forms for more than 120 search engines), 2ask (several hunderd search engines), Internet Sleuth (several hunderd), Infomine (more than 90), Search.com (more than 250), W3, and Cusi.
Geographically focused navigators These navigators are arranged by continent, country, city etc. They often use a series of increasingly more specific maps that allow you to get the locale you wish to go (and therefore they could also be seen as geographical guides). They can be the quickest way to get to a resource in a specific locale. They are fast in finding Internet sites in remote areas. Examples are the CityNet (from Excite) and the GeoSurfer.
Specialized or focused Search Engines Immense amounts of very useful materials can be found in other parts of the internet. There is a whole bunch of specialized navigators that focus on: Gophers, FTP sites, NewsGroups and Mailing Lists, Libraries, Ejournals, Software, Shareware, Products and Services. Some of them will be explained in the section for specialized search engines, others will be dealt with on special pages of the SocioSite such as the "Libraries", "Electronic Journals", "Who's where?", "What's New?", and the "NewsPaper and NewsServices" pages.
Most people start with Yahoo, because this is the best known search engine. This is not a bad choice because it also happens to be one of the best. Yahoo is a good place to start with, but it is a man-made directory and therefore carries only a limited amount of information.
For sociologists the ClearingHouse for Subject-Oriented Internet Research Guides is a very productive source of information. This metacatalogue has specialized guides for many subjects in the social sciences.
Many people are impressed with Excite and InfoSeek. In many tests they get the highest honours. But there are some very good and fast alternatives: Hotbot, Alta Vista and Google also get high marks. Together they will give you almost all the results you need. And if you include the WWW Worm, you have little need for anything else.
If you're searching for keywords in documents, Open Text is extremely fast.
The meta search engines provide one-stop access to several engines. You will see more mega or meta search machines on the market, especially multi-threaded search pages, which can rummage through multiple engines simultaneously. MetaCrawler is problably the best and fastest uniform meta-search engine. But newer systems like iFind and JavaBot are closing in. Although the metapages are very useful, they present their own limitations: they don't usually offer the full search customizability of the original engine, so the results are generally much less precise. And they can be painfully slow. A multithreaded search page must work its way through several sites on the Net, any one of which might be tied up doing other work. (the servers where these free services reside occasionally have to earn their keep by doing some real work like accounting or data processing.) Such delays can grind your search to a standstill. The best results are achieved by using one query word only. The reason for this is that there is no standard for search engines on the Internet and they all have their own way of treating the words that you enter. When you enter 2 or more words some search engines will default to treating those words as implicit OR, while others will handle them as implicit AND or as a phrase. Meta searches are only really useful if you are doing a very broad search or are familiar with the databases that you are searching. It's just like all the other things in life: you will enjoy it most when you are fully aware of the limitations.
On the search pages of the SocioSite you will find short reviews of the major engines. They give a description of the features and benefits of the engine, the content and size of the databases, the ways to search and the character of the results, and their pros and cons.
What these acquisitions mean for the cost of searching on the Internet remains uncertain. But one thing is clear: they want to make huge profits, and you are a potential victim of their exploiting strategies. So watch your wallets, and support all the initiatives to decommercialize the Internet.
Some people think that there might be a silver lining. They expect the tools to become more refined when we start paying for searches. We all want good search instruments to get precise results easily and quickly. If you think that the commodification of the Internet will create the instruments to reach this goal, you must be prepared to pay a price. And that will not only be the money you'll have to pay for every search action. It will also be the loss of free access to information on the Internet. Commercialization of the search engines implies that the electronic access to information will be sold as commodity to those who are able to pay for it. This would mean a serious infringement of the rights of the netizans - the loss of a liberated area that has been conquered with great efforts.
The popular search tools have moved from the laboratories of computer scientists and are now affiliated with for-profit organizations. That's the way things normally go in the post-modern age of cybercapitalism. But this is not a natural law. Social structures are made by men, so they can always be changed by men. That's also true for the way we organize the electronic access to the enormous wealth of information on the Internet. Erecting financial barriers would substantially affect the open character of the Internet. We may and can resist this. Nobody has to be ashamed of democratic initiatives - on the contrary.
Some search engines - like Alta Vista - support full Boolean searching. You can use 'and', 'or', 'not' and 'near' to expand or constrict a search.
Many search engines have two interfaces - one for simple keyword searching and another for more advanced queries using Boolean operators. Simple keyword search interfaces are located on the home page of each search engine, so these are the first interfaces that the user sees, and many users tend to use these by default rather than exploring the other options available. These interfaces generally provide a fast and easy to use tool for very simple WWW searching, but their use may be problematic given the size of the WWW and the generally diverse nature of material available.
Each search engine has its own distinct features and capabilities. In most cases instructions for using the search engine are included somewhere on the site. These instruction may contain unfamiliar terms that relate to specific functions. Here are the definitions for seven common functions (not all search engines provide all the functions defined).
Natural Language Queries: For novice Internet users, this is probably the easiest way to search the Web. Users enter questions in natural English, and the server software extracts relevant keywords to create a database query. For example, the phrase "Find pages about INEQUALITY, labor market, or segmentation" would resolve into the individual keywords INEQUALITY, labor, market, and segmentation.
Boolean Searching: Allows terms to be put into logical groups by the use of connective terms. One of the most popular ways servers handle multiple keywords is by linking each with a Boolean AND or Boolean OR. For example, cats AND dogs narrows a search. Cats OR dogs broadens a search. Cats NOT dogs narrows a search. Each service explains its connective terms for Boolean searching in its help or FAQ file. Some systems are defaulted to a certain connective term without the use of that term. So in some cases cats dogs is treated as cats OR dogs.
Keyword Controls: Rather than requiring some relation between keywords, some search engines allow each keyword to be qualified individually. Each keyword in the query can be prefixed with special characters like + or - to indicate that they are required (much line Boolean AND) or that they are required not to be in the document. Often, unqualified keywords are linked a Boolean OR by default. For example, the Keyword Control query "inequality, labor, market -income +rights" is equivalent to the Boolean "(inequaliy OR labor or market) AND (NOT income) AND rights".
Keyword in context (KWIC): These searches will return the key word and N words near the key word to give the user the context in which the key word was found.
Phrase Searching: Allows searching of phrases when available. Some systems can be confusing if you think that "Rural Sociology" is searching the two words together as a phrase, when in fact the engine is searching Rural OR Sociology.
Proximity Searching: Allows searching of one term within N words of another term, narrowing the search.
Relevance Feedback: Attempts to measure how closely the retrieval matches the query, usually in quantitative terms between 0 and 100 or 0 and 1,000.
Truncation Searching: Allows searching on different word endings or plurals with the use of a truncation wild card symbol (a sort of suffix management on keywords). This helps users to get the most for their queries by generalizing each keyword to its roots, and expanding the search to include all forms of that root word. For example, if the truncation symbol is *, then the search term econ* will return items that contain economics, economy, economic, and econometric. Car* will return items that contain cars and cartoon, so it is advisable to use truncation symbols judiciously. Most servers perform the truncation automatically according to their own rules. Some servers allow the users to choose which words are truncated, typically by appending a * character to the end of the root word. See individual help files for the specific truncation symbol used with each engine, when available.