Wednesday, February 13, 2008

And this is how it all started ..

(History & Future of the search)
(Edited from the Stanford Magazine article from 2004 – “Net Assets: How Stanford’s computer science department changed the way we get information” by Richard Brandt)

The Internet is today’s equivalent of the Alexandria library, with more than 500 billion web pages and growing. Making sense of this morass is more crucial than ever in a world that runs on information. The right—or wrong—intelligence affects decisions from running economies to going to war. It has become obvious that search technology is the single most important application on the Internet.

As it happened, 1994 was the year Netscape Communications released its web browser, transforming the esoteric Internet into the point-and-click World Wide Web. (People now use the two terms interchangeably.)

The first Stanford students to make a commercial success out of helping people find things on the Internet were David Filo and Jerry Yang, who started Yahoo! The venture was never a true search engine—a software program that pulls up web pages relevant to keywords the user types. Rather, it started simply as a hand-selected list of interesting websites called “Jerry’s Guide to the World Wide Web.” It evolved into “Yet Another Hierarchical Officious Oracle,” or Yahoo!, a portal offering hand-selected sites and free software deemed useful by Yahoo’s “domain” experts—the equivalent of Callimachus’s bibliography. To find other web pages, Yahoo! offered search engines licensed from other companies.

The Yahoo! story also began in 1994. As part of their Stanford doctoral course work, Filo, (MS in’90), and Yang, (MS in ’90), wrote a business plan based on their web guide. Students had to evaluate each other’s plans, and Brian Lent, a PhD student in the database group, gave Yahoo! a D-minus. Lent, (MS in ’95), thought the selection process should be automated, rather than hiring scores of experts to find the right sites as the web grew.

Let that be a lesson to anyone with ambitious plans for their research: you have to ignore a lot of naysayers. When Filo asked Lent if he would like to join Yahoo! as employee No. 1, in order to keep the founders on their toes with his skepticism, he laughed. “You couldn’t pay me enough money to work for a company called Yahoo!” he recalls saying at the time.

Still, Lent was at least partially right. By the late 1990s, almost all search engines had given up trying to make search a profitable enterprise and were busily transforming themselves into portals modeled after Yahoo! But after Google showed up in 1998, most of those portals went out of business, while Yahoo! spent about $2 billion buying search technology to add to its site. Microsoft eventually started creating its own search technology, hoping to release it sometime next year.

Throughout the 1990s, search engines primarily retrieved pages according to how many times given keywords were found on a site. It’s as simple an idea as alphabetizing scrolls, and no more innovative than Yahoo!’s approach. But these engines were easy to fool. For example, by simply typing “sex” over and over again in black type on a black background to make the words invisible, site programmers could attract a lot of hits from search engines, whether or not the site had anything to do with the topic people were looking for.

When Google’s search engine was officially launched in December 1998, it was distinguished by one big attribute. It worked.

At its core is the PageRank system, invented by Larry Page (and named after him) while he was working on his PhD at Stanford. PageRank, which judges a site’s importance by analyzing outside links to it, was the first true innovation in search technology since the bibliography. It takes advantage of the unique properties of the web—the network of links that makes its name so apt.

Garcia-Molina, Page’s adviser, recalls how it all started. Page came into his office one day in 1995 to show him a neat trick he had discovered. The AltaVista search engine not only collected keywords from sites, but also could show what other sites linked to them. AltaVista did not exploit this link information in the way Google would, but Page suggested it would be a good way to rank sites. He reasoned that those with the most links probably were the most popular and would prove most useful to searchers: they should be listed first in the search results. He began creating his own software for analyzing links between sites.

Meanwhile Lent, the student who had all but failed Yahoo!’s business plan, had been working with Brin on a research project within the database group. In 1995, they decided to try a little associative data mining. This is the process of finding pieces of information that commonly occur together. Retailers use it to search through their sales records and determine whether different items are frequently bought at the same time by customers. (They then can place those products as far apart as possible in the store, hoping to lure customers into additional purchases.)

Brin and Lent worked on ways to find specific word combinations that often occurred together on the Internet, such as authors and their book titles. This required searching through masses of web data, so Brin wrote a “crawler” program—software that visits websites, summarizes their content and stores the data in a central location accessible to graduate students and search companies.

He intended to call the crawler “Googol,”—after the word coined by the 9-year-old nephew of mathematician Edward Kasner for the number 10100—to reflect the enormous amount of data they were collecting. For two years, Lent recalls, they did not realize they were spelling the word incorrectly.

Later, Page combined his method of analyzing “back” links pointing to a given website with Brin’s web crawler, and their combined research moved under the Digital Library umbrella.

Lent, who had a tendency to wander back and forth between university research and corporate life, did not stick around to work with Page and Brin, a decision he confesses he regrets. But in early 1996, Lent explains, “We all said, ‘There will never be another Yahoo!’” Their research seemed purely an academic exercise. Lent was itching to get back into business, so he joined a start-up company.

But the Google search engine, first set up to troll through Stanford’s own web pages, was an immediate hit with students and faculty. Page and Brin became convinced of its commercial potential. With help from Stanford’s Office of Technology Licensing and a number of professors (see sidebar) they managed to get their company funded. To bring in revenue, they borrowed an idea from GoTo.com (later renamed Overture and acquired by Yahoo!), a sort of Yellow Pages search engine that went through ads, not websites. Google now simultaneously searches through websites and its own advertisers, listing the relevant ads next to the search results. This has become the most successful advertising approach on the Internet.

Page, (MS in ’98), and Brin, (MS in ’95), may have become yet another two PhD students to disappoint their mothers by dropping out of grad school to start a company. But the research they started continues at Stanford, officially encapsulated in a project known as WebBase. Using the techniques first developed by the Google founders, the core of WebBase is a huge archive of websites now stored at the San Diego Supercomputing Center. Researchers from Stanford and other universities around the world can download and work with information about millions of websites as they develop search and retrieval technology.

Stanford has continued to supply Google with brainpower and new ideas in search.

As for Lent, he has not given up. He got a call from Microsoft in 2003, telling him the company wanted “to kill Google,” he recalls. He considered joining the team, but decided that if Microsoft could do it, so could he. Lent is now an “entrepreneur in residence” at Silicon Valley venture capital firm Mohr, Davidow Ventures, putting together a start-up team that will tailor search to individuals’ interests.

Lent describes his quest as “a bit psychotic—I mean, who goes after Google?” But he thinks Google left him an opening. “I felt Google was stagnating,” he says. “Their core premise is still link analysis. But the other half of the equation is user behavior.” Lent has an algorithm he calls “Dynamic PageRank,” which adds the dimension of time to web searches in order to better determine people’s interests. How long do people stay on web pages; what hour, day or week are they most active; what ads do they most often click on; and what products do they most often buy? By tracking their interests and behavior, Lent thinks he will be able to give web searchers better results.

Because he “passed on two companies” that spun out of Stanford and became huge successes, Lent notes, “I need to give it a try. Google and Yahoo!, be warned.” Unless, of course, one of the companies becomes impressed enough to buy his start-up.

Google has already bought a company that was developing technology to personalize web searching. That company was founded—you guessed it —by a few Stanford computer science graduate students.

Glen Jeh was in the PhD program in 2003, working within the database group, when he co-wrote (with Widom) a prizewinning conference paper called “Scaling Personalized Web Search.” His approach to personalizing searches lets people specify their interests in advance. The problem is that adding individual preferences to web searches presents a difficult computational problem. Since there are millions of users, each with separate criteria, there are simply too many permutations to quickly find all the websites that simultaneously match search terms, have the highest PageRanks and correlate with their lists of interests.

Jeh, (MS in ’03), came up with the idea of “partial vectors,” common preferences shared by many people. Sites that match many of these preferences are given higher priority even before anyone does a search, narrowing the field. Then when an individual does a search, his or her other preferences are calculated in. That can still require a lot of expensive computing power, though, so two other PhD candidates, Taher H. Haveliwala, (MS in ’01), and Sepandar Kamvar, (PhD in ’04), improved the efficiency of calculating Jeh’s partial vectors, and the trio set up a company called Kaltix last year. Google snapped it up within months.

Some of Stanford’s computer science grads have stayed in academia, and continue to conduct research into the intricacies of web search. Junghoo Cho, (MS in ’97), (PhD in ’02), is an assistant professor at UCLA. He’s concerned about Google’s ability to alter the makeup of websites. Since a relatively small number of sites have the most links, and Google retrieves them first, those sites get visited more often and even more people link to them. Cho’s studies indicate that Google in effect drives more and more traffic to fewer and fewer sites.

Search technology research also continues at Stanford. Professor Andreas Paepcke, director of the Digital Library program, and several grad students are working on programs to search through digital photographs. Their technique combines data from the camera’s date/time stamps with information such as birthdays, holidays, vacations and major events—even data from Global Positioning System satellites—to help identify what photographs depict. This is the first step in searching through them.

Chris Manning, a professor in Stanford’s artificial intelligence group, is trying to get computers to understand “natural language,” with all its semantic subtleties, as it is used (and misused) by humans. One of Silicon Valley’s Great Tech Hopes is a “semantic web” that will allow computers employed by search engines and other sites to respond to questions written in plain English, or other languages. This is something the search site Ask Jeeves claims to do, but even Ask Jeeves executives admit their first versions were mainly a gimmick, simply picking out keywords in the questions people typed. The company is trying to improve that technology.

Stanford’s significant role as originator of search technology may be winding down, though. For one thing, this academic year will be the last for Digital Library funding. And leading research is moving into corporations, now that Google has demonstrated how profitable it can be. “We’ve been discussing the question of whether there’s anything new to do in search,” says Garcia-Molina. “With all these big companies out there, what can we do?”

Professor David Cheriton, an early investor in Google, puts it more bluntly. “When you have something like Google occur, where you can hire a bunch of great researchers all motivated by stock options, it’s hard for pure research organizations like universities to compete.”

Did anyone say, “There will probably never be another Google?”

Check out


No comments: