Visit Your Local PBS Station PBS Home PBS Home Programs A-Z TV Schedules Watch Video Donate Shop PBS Search PBS
I, Cringely - The Survival of the Nerdiest with Robert X. Cringely
Search I,Cringely:

The Pulpit
The Pulpit

<< [ Just Say No ]   |  Just the Facts Ma'am  |   [ The Great Apple Video Encoder Attack of 2007 ] >>

Weekly Column

Just the Facts Ma'am: iReader finds meaning whether we want it to or not.

Status: [CLOSED] comments (48)
By Robert X. Cringely
bob@cringely.com

Computing interfaces last a long time. Though a thousand readers will correct me with their superiorly nuanced views of the past, let's say we generally began with punched cards, went to command lines, then to text-based graphical interfaces, followed by a true graphical interface and for the last decade a lot of people have viewed that graphical interface a least in part through a browser. But what if you have a computer interface that doesn't follow these broad lines, what do you do? Syntactica's iReader may show us.

IReader (which was called Speed Reader until a moment ago) started life a year ago as a search product called ePrecis, an application that could look at an article, book, or a web site, and give you its meaning in a prioritized list of short sentences. If you allowed ePrecis to return enough sentences, it would eventually return the entire object being searched. But if you limited the number of sentences, it returned the best possible approximation of the TOTAL content that's possible within the space constraints.

I'd love to write here that ePrecis told me the meaning of Moby Dick was "sea mammal obsessions are bad," but it doesn't work exactly that way, or at least not with that much fun. It doesn't matter anyway because ePrecis lived precisely three days before being effectively shut down by Google. You see, to most effectively search the Internet, ePrecis took the shortcut of searching the best Internet proxy -- Google -- which was NOT a good idea.

While the product got some good press last year, it didn't last of course, and a lot of that good press came AFTER ePrecis was effectively dead.

But did it really have to die? I would have just switched the search target to the Internet Archive and expected the same or better results, but of course ePrecis, by scraping Google, was effectively riding on Google's PageRank algorithm, which gave greater relevance to the results. If I were Google I wouldn't like it, either.

So now a year has passed and the same folks are back with iReader, another approach based on the same underlying technology.

Instead of being a web search engine, spiders and all, iReader is a tool to create synopses of content based on browsing, not searching, and on mousing, not clicking. These distinctions are important, in large part for legal reasons intended to keep the Googlers at bay. Searching pretty much requires scraping the Internet for content that is then indexed, while iReader's new browsing metaphor doesn't kick into action until the user mouses over a URL (no clicking required, hence no stepping on the toes of Google or any Google competitors). Only then does a Syntactica server take a quick look at the URL, process it through the same linguistic engine used in ePrecis, then spit out a short synopsis of the content. The fact that this can take place in real time with a lot of people online at any one time is pretty darned impressive.

IReader, which functions as both a Word macro (great for compiling abstracts and indices, I suppose) and a plug-in for Firefox or Internet Explorer, is amazing and fascinating. It can also be annoying.

The underlying process is what I find fascinating, perhaps because of my personal involvement with one of the earliest search engines -- Architext, later called Excite -- that also took a statistical linguistics approach.

Most traditional search engines prior to Excite used Boolean keyword searches or Boolean search augmented with a thesaurus or so-called topic-tree searches. This is all old stuff. And so too was Excite's "vector-based searching," which was invented almost 40 years before, but never quite worked right. Vector-based searching uses no Boolean operators, thesauruses or topic trees. It doesn't even matter what the words mean. All that matters are the words themselves.

Vector-based searching begins with making an index of words in a document. Using this column as an example, the software would examine all the words I have written here, throw away words that carry no real information -- words like "the," "and" and most verbs -- then count the instances of each of the remaining words. Each word in the column becomes a vector in a multidimensional space. If I have used the word "Internet" 15 times in this column, then "internet" defines the direction of the vector and 15 is its length. Adding all the vectors in this column yields a single vector that represents the entire column in a multidimensional space defined by all the words in all the articles in the entire database.

Doing a search using this system is simply a matter of entering a natural-language query, which is parsed and indexed in exactly the same manner, yielding another vector. This search vector is plotted in the multidimensional space and the search results are those vectors (those articles) that are nearest in space to the query vector. The closer to the query vector an article vector lies, the more likely that article is to answer the question posed in the query.

EPrecis and now iReader use a similar approach, but where the actual words didn't matter to Excite, they matter a LOT to these new products. The magic here is a so-called "intelligent dictionary" or lexicon compiled over more than 20 years beginning at Control Data Corp., which was headquartered, like Syntactica, in Minneapolis.

The great problem with obtaining meaning from text is understanding the context in which that text appears, and this is where Syntactica's lexicon shines. This lexicon is a compilation of a meticulous word-by-word analysis of Webster's 3rd New International Dictionary, unabridged. This compilation considers the many different meanings and contexts of each individual word in the lexicon, and assigns a set of values to each word, which is a heck of a lot of work and explains why most competing products (there turn out to be a bunch) don't have it.

IReader turns out to be an adjunct to browsing. Run your mouse over any live URL and iReader pretty quickly returns three to four sentences describing the contents of that URL without your ever having to visit that web site. It's really useful. But as I wrote earlier it can be annoying, too, which is why I assigned a function key to turn the darned thing on and off, keeping it out of my face most of the time.

I wish the Syntactica people well (I do not own stock). They've done a good and difficult job, published parts of iReader as open source, created some useful APIs and built the whole darned thing as a web service that can be built into all sorts of gizmos. They also seem to think that iReader will become a popular alternative to browsing, especially since it comes minus the ads. I just don't see that.

My guess is iReader and Syntactica will be snapped up soon by a Google or Yahoo or one of the other usual suspects. And if not, then I predict the product will be used to create a whole new class of annoying web denizens -- web pages made up only of such abstracts as a kind of meta-search in the same way that so much of the web is now clogged with meta-ads meant to sucker us into clicking on them.

Personally, I prefer my web content -- like my women -- real, not inflated.

Just kidding, Mrs. Cringely.

Comments from the Tribe

Status: [CLOSED] read all comments (48)

Output for generated for this article - judge for yourself... :)


# Following by true graphical interface and last decade lot of people viewing that graphical interface least in part through browser
# Large part for legal reasons intended to keep Googlers at bay
# Throw wording that carry no real information words like
# Adding vectors in column yielding single vector representing column in multidimensional space defined words articles in database

pete | Mar 08, 2007 | 5:10AM

Here's an example using http://www.google.com/firefox - Firefox default start page, "About Mozilla" link on that page:

* Global community dedicated to building free, open source software products and technologies

* Programmers, marketers, testers and advocates around world working to ensure that Web remaining open sharing public resource

* Open source software products and technologies offering free-of-charge to people everywhere over 40 languages

Pretty nice!

Charlie | Mar 08, 2007 | 11:02AM

This sounds very similar to a technology Google acquired Orion Algorithm from Ori Allon. The release at that time said something like:

The results to the query are displayed immediately in the form of expanded text extracts, giving you the relevant information without having to go the website.

So far we havent seen Orion release, wonder what Google is doing with that algorithm !!

Ravish | Mar 08, 2007 | 12:51PM