relevancealgorithms used in most large web search engines today are based on fairly simple word-occurence measurement: if the word "daffodil" occurs on a given page, then that page is considered relevant to a query on the word "daffodil"; and its relevance is quantised as a factor of the number of times the word occurs in the page, on whether "daffodil" occurs in title of the page or in its META keywords, in the first N words of the page, in a heading, and so on; and similarly for words that a stemmer says are based on "daffodil". More elaborate (and resource-expensive) relevance algorithms may involve thesaurus (or synonym ring) lookup; e.g. it might rank a document about narcissuses (but which may not mention the word "daffodil" anywhere) as relevant to a query on "daffodil", since narcissuses and daffodils are basically the same thing. Ditto for queries on "jail" and "gaol", etc. More elaborate forms of thesaurus lookup may involve multilingual thesauri (e.g. knowing that documents in Japanese which mention the Japanese word for "narcissus" are relevant to your search on "narcissus"), or may involve thesauri (often auto-generated) based not on equivalence of meaning, but on word-proximity, such that "bulb" or "bloom" may be in the thesaurus entry for "daffodil". Word spamming essentially attempts to falsely increase a web page's relevance to certain common searches. See also subject index.
Last updated: 1997-04-09