Search Engines: The Relevance of Underpants to Searching the Web
In the courses that I run on Internet searching people often express concern, surprise or plain puzzlement at the way in which search engines relevance rank the results that they return. Consequently, I thought that it might be interesting to cover this subject in the column.
Why rank pages? The search engines that are what I call 'free text' search engines will, in general terms, be much more comprehensive than those that provide access to websites via a directory or index. Consequently, good examples are AltaVista [1], HotBot [2], NorthernLight [3] and so on. These will commonly index many millions of pages (this isn't the time to go into detail about just how many pages they do or don't index, but suffice to say most people would agree on figures of between 200-300,000,000). As a result, even relatively obscure terms are going to be found on a fair number of pages. If you do a search on a more common term engines will retrieve thousands of hits. Let's take the word 'underpants' for example, which is the term that Lycos are currently using in their television advertisements. A search for that term will retrieve 4,537 on Lycos [4], 22,442 on Northern Light and 23,192 from AltaVista. (As an aside, I hate to think about what context the term has been used in, since I didn't dare look!).
Now, there is very little point in just returning the webpages in no order at all, so all the search engines will try to decide what is the most appropriate, or relevant of the pages, to try and give the user the 'best' of them first. As a result, they will use a series of algorithms to work this out. Unfortunately, search engines tend to view these as one of their secret weapons; lots of them play on the fact that they give you the 'best' results, so it's not in their interests to make much information on how they rank publicly available. It's up to individuals to look at the results that get returned, examine the pages and try and work out for themselves why page 'a' is ranked higher than page 'b'.
The advantages and disadvantages of ranking.
It's clearly obvious that pages have to be ranked; if they weren't it would be impossible to find the most useful page out of many thousands. However, there are a number of disadvantages. Perhaps the major of these is that the engines can only be as good as the data they've been asked to find. Imagine that a client comes into your library or information centre, looks at you and just says 'Underpants' - apart from thinking they need medical help it's almost impossible to help them. Do they want a history of underwear, or a list of suppliers, or references to underpants in the news, or something else entirely? Search engines have the same problem; unless users put their requests into some sort of context they have very little to work with. Consequently, the knack of doing a good search is to add in more terms to provide an appropriate context, and I'm sure I don't need to tell you how to do that. A second, almost as big disadvantage is that users are at the mercies of the authors abilities to write a web page that will get a good ranking by engines. Most people can write a reasonable page or site; one that loads quickly, is content rich and can be navigated easily. However, the problem arises that if they don't understand how search engines work it is going to be a matter of good luck if they get a good ranking. Of course, the opposite is also true, in that if you know how the engines work you can get a good ranking for your site almost regardless of its content - how many of us have done a perfectly straightforward site, only to end up with one or more pornographic sites being in the top ten? This isn't the fault of the search engine, or even the person doing the search, it's down to the fact that pornographers are very good at getting high rankings, and they spend a lot of time tweaking their sites to ensure this happens. I don't blame them for that, although I accept it is an annoyance for those of us who don't have a particular interest in such sites, but I think that it's a shame that authors of sites and pages don't go the extra mile to get good rankings.
What do search engines take into account?
By necessity, this section is going to be vague, since I don't know exactly what one search engine will regard as an important weighting when deciding relevance. However, I can tell you some of the things that they take into account, and I hope you will find this useful for two reasons. Firstly, if you are a searcher, it should help you to construct a better search strategy, and to understand why you retrieve what you retrieve. Secondly if you're writing a web site for yourself, an understanding of relevance ranking may come in handy when you're designing your pages.
1. Words in the title.
By title, I don't mean what you see in the main body of the screen, I mean what you see at the very top of the browser screen, or when you bookmark/favourites a site. Some search engines will pay very great attention to the words that they find in the title element of a web page, and to once again use my slightly bizarre search term, the first page retrieved by AltaVista is for 'DryNites disposable absorbent underpants', while the first entry for Northern Light is entitled 'It's all about the underpants' (I'm honestly not making any of this up, incidently!). In fact, at Northern Lights at least the first 80 entries contained the word in the title element (and probably more besides, but I couldn't cope with looking at more results!) while back at Alta Vista the same can only be said of 2 of the top ten entries. Interestingly, although Northern Light also indexes the DryNites site, it doesn't appear in it's own top ten at all. Consequently, it's clear to see that Northern Lights pays more attention to words in the title than does AltaVista.
2. The number and position of search terms input.
Most of us are aware that if we put several terms into a search the engines will usually do an OR search, which will result in a large number of references, and they will rank web pages that contain all of those terms higher than those which contain a smaller number of them. It therefore makes sense to give search engines more to work with, rather than less. However, what many people don't realise is that search engines will often (but not always) pay attention to the position of words as you have input them.
For example, the search:
cars car automobile jaguar
and
jaguar cars car automobile
look as though they are the same search. However, if you run these two searches on Alta Vista and Northern Light you will get totally different sites in the top ten, and paradoxically it seems that the first search returns rather better references than the second. I'm not going to pretend that I follow the logic behind this because I don't - all that I can assume is that both engines seem to think that most people will put their preferred term last in the list, rather than at the beginning. I would however strongly suggest that you run a couple of test searches on your own preferred search engine(s) to see if the same holds true of those as well.
3. Words in the < H1 > ... < /H1> header
If you don't author pages yourself, the above will seem like gibberish. However, < H1 > and < /H1 > are the opening and closing tags used by authors to give greater prominence; rather like chapter headings. It is therefore logical that if a search engines finds two pages that include the same words, but in one they are in the < H1 > ... < /H > tags it will give a higher ranking over and above the one which doesn't. Consequently, if you're an author yourself, it's worth while working out what your keywords are, and giving them extra prominence on your pages.
4. Repetition.
This used to be a key way in which search engines ranked pages, because if a page mentions 'widgets' seven times, it has to be more relevant than a page that only mentions 'widgets' twice, surely? Logically that is the case, but unfortunately, this idea was picked up very early on by some web site authors, and a favourite trick was to chose a background colour for a page and at the end of each page, in a small font size and with text the same colour as the background repeat keywords over and over again. The search engines would see the words but the casual viewer would simply see half a screen of so of apparently blank screen. Search engines caught onto this trick very early on however, and so they tend to either ignore repetitions such as widget widget widget widget widget widget (etc) or they will downgrade or even remove such pages from their indexes. However, some engines may well still pay attention to repetition throughout a page if it's done correctly.
5. Proximity
Pages that contain the words that you have asked for in your search where the words are close together are generally ranked higher than those pages where the words are spread throughout the page. Consequently, even if you don't do a phrase search (which is always a good idea), pages with the words 'white' and 'underpants' will rate higher when the terms are found next to each other.
6. Unusual or rare terms
If you search for the two words 'bloomers' and 'underpants' because 'bloomers' is a more unusual term (though I have to admit this is an assumption on my part; I've not actually checked this) web pages that contain 'bloomers' will rank higher than those which contain 'underpants' and pages that contain both will rank higher again. Therefore, when running your searches, simply be aware that usual unusual terms will affect the ranking you get.
7. Meta tags
These are tags that can be added to the web page by the author and which will give extra emphasis to certain terms. Some (but not all) engines look for the meta tag element (which is not visible on the page, but you should be able to see them if you View/Source). Once again however, you are at the mercy of the author here; if she remembers to put them in it will greatly affect the ranking, but if they are left out, the ranking may be lower.
8. Links
Some engines pay particular attention to the number of other pages that link to the pages that it retrieves, and they work on the assumption that if lots of people link to a page it will in some way be a 'better' page than if very few people link to it. (The flaw here of course is that new pages will have fewer links than older pages in many cases). Google [5] is a good example of a search engine that utilises this technique, and many people find that it gives particularly good results.
9. Density.
If a web page is say, 100 words long and repeats 'underpants' 5 times, that's a 5% density. If another web page is 1000 words long and contains the same word 10 times (giving a density of 1%) although the second page has more occurrences of the word the first page may well rank higher since the word is, relatively speaking, more common and therefore the page has a higher ranking.
10. Paying for placement.
Some search engines have tried using this in the past, but because people don't particularly like this (since it distorts the results they retrieve) the majority of engines have dropped this idea, preferring instead to make their money by linking appropriate advertisements to the searches the user is running.
Conclusion
As you can see, it's a very confusing area, so if you've ever been puzzled as to why one page ranks higher than another, you're not alone! Ranking is a complicated process, not helped by the fact that engines all do it differently. That's the polite way of looking at it. The more blunt view is that it's a total mess and is compounded by the secrecy surrounding ranking. However, for those engines that use this process (rather than the Index/Directory based approach favoured by Yahoo! among others) it's the best that we can hope for. Ideally a controlled thesaurus of some sort (along the lines of the Dublin Core [6] for example) would help bring some order to chaos, at least in the area of meta tags. However, I'm not going to hold my breath on this one since agreeing to any sort of standards on the Internet is akin to trying to herd cats.
All that I can suggest is that if you're trying a search and it doesn't work on one engine, it's not necessarily because you've done a bad search, it might just be that the ranking process just doesn't reflect your own ideas of relevance, so try another engine and run the same search again, and you may be lucky! Oh yes, if you ever have to do a search for underpants you have my sympathy, because there are a lot of weird sites out there - thanks Lycos!
References
1. AltaVista
<:http://www.altavista.com>
2. HotBot
< http://www.hotbot.com>
3. Northern Light
< http://www.northernlight.com>
4. Lycos
< http://www.lycos.com>
5. Google
< http://www.google.com>
6. The Dublin Core Metadata Initiative
< http://mirrored.ukoln.ac.uk/dc/>
Author Details
Phil Bradley
5 Walton Gardens, Feltham, Middlesex.
Email: philb@philb.com
Phil Bradley is an Independent Internet Consultant