Search and Ye Might Find: How the Very Success of the Internet Makes It Harder to Find the Good Stuff
bob@cringely.com
I am luckier than most Internet users because I run my own mail server. It's an old PC from around 1995 that I bought, as I do nearly everything, on sale. Once a week, I reboot it, and about every 18 months, I upgrade the software, but otherwise, it just soldiers on the way we like to pretend all computers do. It is easily the most reliable thing in my life. But its function isn't really to send and receive e-mail. What it does most is reject e-mail.
My server has a spam filter that I update frequently. As a result, it bounces more messages than it accepts. Unwanted e-mail has become for most of us a problem that comes with an easily quantifiable cost. How much time does it take you to read and/or delete those messages? How much bandwidth and how many CPU cycles do they consume? How many spam messages actually prove to be of value to the recipient? I can't specifically recall a spam message that I was glad I read — ever. But the worst part is that they just keep coming, inexorably.
There is a way to stop this, of course. That's what the law says. There is always a way to tell the spammers to take our names off their lists and stop sending. This method is either a mailto: link in the message, or sometimes, a link to an unsubscribe page, or maybe just replying with the word "unsubscribe" in the subject line. In order to comply with the law every spammer must have a way for us to stop the flow. But they apparently aren't required for that method, whatever it is, to actually work.
I find more and more of these unwanted messages have what look like methods to end the flow of messages, but those methods don't work. That may look like a mailto: link, but it doesn't function like one. Every other link in the message works perfectly, just not the one I want to use. Replying with "unsubscribe" doesn't work. Nothing works except telling your mail server not to accept delivery.
It's not just spam that is drowning the world. We're also awash in bad Web content. Now Web content is quite different in that we go to it rather than having it come to us, but the flood is still real because it inhibits our ability to get to the content we really want. This is because of what I view as a failing in the current search engines. In any other industry, pointing out an area in which a product is deficient would look like criticism. But in the Internet industry, I hope this is more properly perceived as an invitation to start a new company to fix this problem and make a few bucks along the way.
Without search engines, the World Wide Web would be completely unusable for many people. We're that dependent on them. The engines work in different ways and have different strengths, but the ones I use primarily are Altavista, Excite, Google, Hotbot, and Northern Light. Altavista is comprehensive and fast, and offers the unique ability to translate many foreign language pages into what purports to be English. Hotbot is FAST. But to me the really interesting engines are Excite, Google, and Northern Light.
Excite is interesting to me mainly for nostalgic reasons, because I first saw it in 1993 when the program was running only on a cluster of PCs in a Palo Alto garage. Excite search technology is interesting and underutilized because it could be useful for many more things than just searching the Web. Most traditional search engines prior to Excite used boolean keyword searches or boolean search augmented with a thesaurus or so-called topic-tree searches. This is all old stuff. And so too was Excite's "vector-based searching," which was invented almost 40 years ago, but never quite worked right. Vector-based searching uses no boolean operators, no thesauruses or topic trees. It doesn't even matter what the words mean. All that matters are the words themselves.
Vector-based searching begins with making an index of words in a document. Using this column as an example, the software would examine all the words I have written here, throw away words that carry no real information — words like "the," "and" and most verbs — then count the instances of each of the remaining words. Each word in the column becomes a vector in a multidimensional space. If I have used the word "Internet" 15 times in this column, then "internet" defines the direction of the vector and 15 is its length. Adding all the vectors in this column yields a single vector that represents the entire column in a multidimensional space defined by all the words in all the articles in the entire database.
Doing a search using this system is simply a matter of entering a natural language query, which is parsed and indexed in exactly the same manner, yielding another vector. This search vector is plotted in the multidimensional space and the search results are those vectors (those articles) that are nearest in space to the query vector. The closer to the query vector an article vector lies, the more likely that article is to answer the question posed in the query.
To give a simpler example, imagine using this technology on a database of resumes. In the database are all the indexed resumes of every 2000 university graduate in America. Now take another resume and use it as a natural language query. This is the resume of the very best salesperson in your company, the sort of person your company would like to hire many more of. Indexing this query resume yields a vector in the multidimensional personnel space. The vectors of other resumes that are closest to the query vector are the people most like your ideal employee. Without even reading the resumes, you can tell that these are the right people to interview. It amazes me that nobody has yet used Excite's technology for this purpose.
Google is much later technology, and takes advantage of the prior work done at Excite. Both products, after all, emerged from Stanford University. But Google's great strength is the simple idea that the Internet, itself, can show us which sites are more interesting than others. If you have 4,000 pages that fit your query, which page do you list first? Google lists first the pages that are linked to most by other pages. This assumes that such links aren't frivolous and that following them can save us some time.
Northern Light subdivides the search results into folders so if you look up "Chevrolet," one of the folders will be Chevys for sale, which might be what you were really seeking. A downside of Northern Light is that it wants you to pay to see some content. Fortunately, there is usually enough information about the content available that you can use another search engine to reach it without paying.
This is all great, but something is still missing. We didn't have a sense of it being missing in the early days because it took a few years of Web building before the problem could even be seen. I'm talking about outdated information.
My sense of how all this information gets on the Web in the first place is that publishers start putting content on Web sites and continue to shovel it in from that point forward. Sometimes — rarely — they'll go back and load some older, pre-Web, records. Our assumption in the early days was that pretty much anything we came across was either new or recent. But lately, I have been looking for information on video compression software for a streaming video project and find a lot of obsolete information from 1997-98. As time passes, we'll have more and more useless junk that I don't really want until, in a few years, it will be hard to find much that's new.
So I want a new search engine that sorts the search results to put the latest information at the top. Or maybe there is a way of representing the URLs in different colors to show what is new since the last spider visit, what is no more than a month old, six months old, one year old, and maybe a color to represent new old material (archival stuff that has been recently loaded). The best world would have the colors AND allow me to sort by those colors.
Maybe this capability is already out there, but I haven't seen it. The challenge, of course, is to conclusively decide how old things are, given that in some cases it can be new postings of old material. But that's what young, smart programmers are for, right? Now one of you get out there and start building this product.
And don't forget who gave you the idea.









