Recently, Tristan Louis wrote two very interesting posts on comparisons between Google, Yahoo, and Technorati's results on searching blogs and counting references to the Technorati Top 100 bloggers as well as the long tail of 11.5 Million bloggers out there. If you haven't read the posts, please go over and have a look at both of them - there's lots of very interesting data and analysis in there, and I think there are some very interesting conclusions that Tristan draws from the data.
However, I believe that Tristan's analysis begs a question that hasn't been asked yet: How accurate are the numbers that search engines report about the size of their result sets?
We give a lot of faith to the numbers that search engines report, when trying to guess how popular something is. Google reports today that there are "about 624,000" results for "long tail". Yahoo reports "about 779,000" results. People quote these numbers as accurate statistics, and Tristan is using these numbers to do some comparative analysis of the coverage of Google, Yahoo, and Technorati's indexes. However, I'm having difficulty ascertaining the accuracy of these numbers. I've listed some examples below, and a simple how-to so that you can check yourself for your favorite searches.
My questions with Tristan's conclusions are not with his analytics, but with the underlying data that he starts with.
For example, when you search for all the results for "Tristan Louis" on Google, it reports "about 575,000". However you can only navigate through 703 results of the entire set. Perhaps this limit exists to more easily keep their indexes small and in RAM (which means they can stuff more indexes onto a single machine). Perhaps from a user (and business) perspective, their testing shows that almost no one except for researchers will go past the first 5 pages of results.
But if you can only view 703 results of about 575,000, where are the other 573,297 results? That's only 0.2% of the search results that the estimate claims. Where's the missing 99.8% of the search results?
Yahoo search says that there are 890,000 results for Tristan Louis.
However, I can only see 1000 results. That's also only 0.2% of the results that the estimate claims, the same viewable results to estimated results ratio as Google. Where are the other 889,000 results?
I don't know whether Tristan's analyses are correct, or if they are simply reflecting the low viewable vs. estimated results ratios of Google and Yahoo's search results. I would love to hear more from Yahoo and Google explaining the methodology behind their estimated results, and how can users access the full result sets for completeness, and frankly, for objective verification.
To be fair, these same questions must be asked of Technorati's results.
Searching Technorati for "Tristan Louis" currently shows 566 posts. Now, that's a lot less than Google or Yahoo estimated results, but not far from their viewable results. Technorati's results are by default sorted by time, and thus when you traverse the result set to the 560-566th result, you see the 566th result, which is the first result in the timeline (250 days ago, as of the time of this post) that Technorati indexed that matched the search term. Thus 100% of the reported results count (at least with this example) are viewable, thus providing a viewable to reported results ratio of 1.
Here are the steps in the experiment, that you can try for yourself, and thus repeat/verify the results we found above, and see what viewable to reported ratios you come up with using each search engine:
For Google:
- Go to Google's advanced search page.
- Use the pulldown on the right hand side to ask google to return 100 results per page instead of the usual 10 results per page. Note that this doesn't affect the end results, but it will mean you'll have to do 1/10th the clicking to find the last result.
- Type in your search term and click on "Google Search"
- Look at the result page. Look for the top right hand side of the page, where Google reports "Results 1-100 of about XXX for YOURSEARCHTERM". Note the estimated set of results.
- Go to the bottom of the page. You'll see the Gooooooooogle graphic, with a set of result pages (usually 1-10)
- Click on the last result page (usually it will be the 10th page)
- Check the actual number of results that Google gives you. Note that you can't go any further.
- Rinse, lather, repeat for your favorite searches.
For Yahoo, here's the steps:
- Go to Yahoo's advanced search page
- Go to the bottom of the page and use the pulldown on the right hand side to ask Yahoo to return 100 results per page instead of the usual 10 results per page.
- Go back up to the top of the page, and put in your favorite search terms. Click on the "Yahoo! Search" button on the top right.
- Look at the result page. Look for the top right hand side of the page, where Yahoo reports "Results 1-100 of about XXX for YOURSEARCHTERM". Note the estimated set of results.
- Go to the bottom of the page. You'll see the Results Page: area, with a set of result pages (usually 1-10)
- Click on the last result page (usually it will be the 10th page)
- Note the actual number of results that Yahoo gives you. I usually find that Yahoo gives 1000 results, usually more than Google. Note that you can't go any further.
- Rinse, lather, repeat for your favorite searches.
For Technorati, here's the steps:
- Go to Technorati's home page
- Put in your favorite search terms. Click on the "Search" button.
- Look at the result page. Look right under the search box on the top of the page, where Technorati reports "XXX posts about YOURSEARCHTERM". Note the result set size. Subtract 10 from this number - you're going to need this in order to get to the last page of results. For the sake of this tutorial let's call this number YYY (YYY = XXX - 10)
- Go to the URL bar in your browser. It should say something like: http://www.technorati.com/search/YOURSEARCHTERM"
- Add the following to the end of the URL: ?start=YYY where YYY is the number of posts that Technorati returned two steps back. Our example URL from the last step should now look like: "http://www.technorati.com/search/YOURSEARCHTERM?start=YYY"
- Note the actual number of results that Technorati gives you. Feel free to click through the results pages prior as well to verify that all the results are there.
- Rinse, lather, repeat for your favorite searches.
I hope that this initiates some discussion about these issues. I'm frankly interested in making sure that researchers like Tristan are accurately comparing apples to apples, and I'm all for additional transparency and verifiability in the results that all search engines provide. Am I missing something here? Can someone from Google or Yahoo help me to understand why their reported results are sometimes 1000 times larger than their viewable results? I look forward to being educated.
Technorati Tags: blogosphere, blogs, google, event horizon, search, search engine, statistics, stats, technorati, yahoo