The reading level for this article is Moderate
Thoughts on Google’s patent&ldots; "Information retrieval based on historical data."
Google’s newest patent application is lengthy. It is interesting in some places and enigmatic in others. Less colourful than most end user license agreements, the patent covers an enormous range of ranking analysis techniques Google wants to ensure are kept under their control. Some of the ideas and concepts covered in the document are almost certainly worked into the current algorithm running Google. Some are being worked in as this article is being written. Some may never see the blue-light of electrons but are pretty good ideas so it might have been considered wise to patent them. Google’s not saying which is which. While not exactly War and Peace, it’s a pretty complex document that gives readers a glimpse inside the minds of Google engineers. What it doesn’t give is a 100% clear overview of how Google operates now and how the various ideas covered in the patent application will be integrated into Google’s algorithms. One interesting section seems to confirm what SEOs have been saying for almost a year, Google does have a "sandbox" where it stores new links or sites for about a month before evaluation.
Google is in the midst of sweeping changes to the way it operates as a search engine. As a matter of fact, it isn’t really a search engine in the fine sense of the word anymore. It isn’t really a portal either. It is more of an institution, the ultimate private-public partnership. Calling itself a media-company, Google is now a multi-faceted information and multi-media delivery system that is accessed primarily through its well-known interface found at www.google.com.
Google is known for its from-the-hip style of innovation. While the face is familiar, the brains behind it are growing and changing rapidly. Four major factors (technology, revenue, user demand and competition) influence and drive these changes. Where Microsoft dithers and .dll’s over its software for years before introduction, Google encourages its staff to spend up to 20% of their time tripping their way up the stairs of invention. Sometimes they produce ideas that didn’t work out as they expected, as was the case with Orkut, and sometimes they produce spectacular results as with Google News. The sum total of what works and what doesn’t work has served to inform Google what its users want in a search engine. After all, where the users go, the advertising dollars must follow. Such is the way of the Internet.
In its recent SEC filing, the first it has produced since going public in August 2004, Google said it was going to spend a lot of money to continue outpacing its rivals. This year they figure they will spend about $500 million to develop or enhance newer technologies. In 2004 and 2003, Google spent $319 million and $177 million respectively. The increase in innovation-spending corresponds with a doubling of Google’s staff headcount which has jumped from 1628 employees in 2003 to 3021 by the end of 2004.
Over the past five years Google has produced a number of features that have proven popular enough to be included among its public-search offerings. On their front page, these features include Image Search, Google Groups, Google News, Froogle, Google Local, and Google Desktop. There are dozens of other features which can be accessed by clicking on the "more" button near the upper right of the screen. We believe that Google is working to tie all these features together to present its users with search options that are, for want of a better phrase, more relevant than those offered by its competitors. As the Internet and technologies available for users advances, different types of files become searchable and therefore relevant to users. Take Google Video as an example. Now Google (and some of its competitors) can find and read text from closed captioning scripts. As well quotes from recent episodes of virtually any TV show are searchable and can be served back to users along side the clip where the quote originated. Now, imagine a merging of video, textual, graphical and audio files in organic search results. This is, in our opinion, the true intent of the ideas contained in the patent document.
The patent document relates primarily to sorting and cataloging organic search results. As we know them today, organic search results at Google are influenced by a number of factors, many of which involve an evaluation of incoming links. Google needs to ensure its users and advertisers that it is capable of taking action against the darker facets of the search engine optimization sector. Recent stories in the mainstream press have left many with the impression that dark-art SEO and link-spamming is the surest way to get top placements. Google engineers take pride in their work and the popularity of their organic search results is the bedrock on which their profitable business models are built. They can’t afford to allow link-spam and deceptive SEO techniques to dominate their organic listings, especially as these listings are about to address and catalog a much more robust and complicated Internet.
Over the past ten months, SEOs have complained and questioned the phenomena known as the Google Sandbox. The sandbox theory explains the time-lag between link-acquisition for a site and link-recognition and reward by Google. A few key sections of the patent document fill in the blanks for SEOs on what Google is examining when a finely crafted link-building campaign falls into the sandbox. The biggest influencer is links and Google is finding new and improved ways to evaluate them.
Google’s core algorithm is based on measuring links coming into a page. Because of this, link-building is part of any good search engine optimization campaign. In the span of a month, incoming links to one or more pages of a website might jump by hundreds or thousands. Some of those links might be useful in Google’s eyes and some might be useless. The question is, how does it sort which is which?
Google collects a lot of data when it examines a page and the links directed on to or off of that page. When Google mentions they are using "historic data" to determine the value of links directed to your page, they are referring to a number of factors. It knows how long the page has been online, or at least when it first became aware of said page. It also knows how long pages linked to have been online. It knows how often links get clicked and also knows which computer, (and in many cases, exactly who) is clicking the link and where that clicking is coming from.
For an example, check out the following sections from the patent document:
- A method for scoring a document, comprising: identifying a document; obtaining one or more types of history data associated with the document; and generating a score for the document based on the one or more types of history data.
- The method of claim 1, wherein the one or more types of history data includes information relating to an inception date; and wherein the generating a score includes: determining an inception date corresponding to the document, and scoring the document based, at least in part, on the inception date corresponding to the document.
- The method of claim 2, wherein the document includes a plurality of documents; and wherein the scoring the document includes: determining an age of each of the documents based on the inception dates corresponding to the documents, determining an average age of the documents based on the ages of the documents, and scoring the documents based, at least in part, on a difference between the ages of the documents and the average age.
- The method of claim 2, wherein the generating a score for the document includes scoring the document based, at least in part, on an elapsed time measured from the inception date corresponding to the document.
By the time a reader gets to item 63, the document has covered dozens of page, site, link and URL related factors that may or may not be included in the current working algorithm.
Here is a quick breakdown of "history factors" we think are relevant to Google’s algorithm today. Please note, each item might refer to a specific page and at the same time, also refer to all other pages associated with it.
- How long a domain or URL has has been registered.
- Has ownership of a domain changed after previous registrations expired?
- Has the physical location of the registrant changed?
- How lengthy is the URL itself? Was it registered to game the index?
- How many pages are included in the website? (A one document or page website is not considered a highly relevant source of information.)
- Freshness and age of document.
- Use of anchor text (both on site and in links directed to site).
- "Trust Factors" regarding sites or pages outbound links refer to, and inbound links are found on.
- The "discovery date" of a particular link and the history of changes involving that link.
- Rate of growth for new links. A sudden burst of growth likely indicates some form of link-spam.
- Variations in anchor text used to phrase links directed to a page being evaluated. If the same anchor text is used in every inbound link, are they phrased that way for branding purposes or spamming purposes?
- Number of searches for keyword phrase associated with the anchor text used in links.
- Number of times Google users click on Google results by entering keyword phrases used in anchor text of incoming links. Does the page being evaluated receive visitors for that keyword phrase on Google’s search engine?
- How do users actually behave while on the page, site or document being evaluated?
There is a lot more to find in this document. Thus far, the more we explain, the more questions we have. One thing we are very sure about, the intent of the ideas covered in the patents extends beyond the search tool we know now. We expect to publish a white paper on our analysis of the patent and its implications early next week. Until then, we advise our clients to stay the course. We have long preached a very conservative approach to Google based on relevant link building (which can be slow going but very effective), highly stratified content that is relevant only to the topic addressed by the site, and clear paths based on multiple keyword phrases for spiders to follow.