Technorati Technical Update

It's been a while since I've been able to blog, but I've just had a minute or two to come up for air, and I wanted to mention the goings-on at Technorati.  Things have been really great.  It seems that lots of people find the service useful, and people are signing up for watchlists.  Unfortunately, what also happened was that the database schema backing the Technorati site was poorly designed (bad Dave!), which made the website slow to a crawl over the last few weeks.  What happened is that I had a really big table in the database called "links" that had the following structure:

CREATE TABLE links (
       lid bigint NOT NULL AUTO_INCREMENT,
       bid bigint NOT NULL,
       linkedblog bigint NOT NULL,
       href VARCHAR(255) NOT NULL DEFAULT '',
       linktext VARCHAR(255) NOT NULL DEFAULT '',
       title VARCHAR(255) NOT NULL DEFAULT '',
       priortext VARCHAR(255) NOT NULL DEFAULT '',
       aftertext VARCHAR(255) NOT NULL DEFAULT '',
       created DATETIME,
       updated DATETIME,
       current CHAR(1) NOT NULL DEFAULT 'Y',
       PRIMARY KEY (lid),
       INDEX (href),
       INDEX (bid),
       INDEX (linkedblog),
       INDEX (created)
);


There were two big problems with this table - first, each record was way too big.  I thought I was getting the best of both worlds when I mde all of my char() columns VARCHARs - only use the space you need, right?  More on that later. The second problem was that I had over 6 million active links that Technorati was tracking, so traversing the database was s-l-o-w.  And since I was using Linux on x86 and MySQL as the backend for the database, I was going to start bumping into the 2GB file size limit in short order.

Here's where some good old-fashioned database optimization techniques came into play.  One of the best things I learned about database design was to make it easy to calculate record offsets.  That means using fixed character field widths for tables that need fast lookups.  So, what I did was split the links table into two tables, and made some changes to the new links table:

CREATE TABLE links (
       lid bigint NOT NULL AUTO_INCREMENT,
       bid bigint NOT NULL,
       linkedblog bigint NOT NULL,
       href CHAR(127) NOT NULL DEFAULT '',
       created DATETIME,
       updated DATETIME,
       current CHAR(1) NOT NULL DEFAULT 'Y',
       PRIMARY KEY (lid),
       INDEX (href),
       INDEX (bid),
       INDEX (linkedblog),
       INDEX (created)
);

CREATE TABLE linkcontext (
       lid bigint NOT NULL,
       title VARCHAR(255) NOT NULL DEFAULT '',
       linktext VARCHAR(255) NOT NULL DEFAULT '',
       priortext VARCHAR(255) NOT NULL DEFAULT '',
       aftertext VARCHAR(255) NOT NULL DEFAULT '',
       PRIMARY KEY (lid),
);


You'll notice what I did - I pulled out all of the information that wasn't in an index, and put that data into the linkcontext table.  I made sure that there was a unique key (lid) that identified the data for a particular link, and also kept the space-saving format of the VARCHAR.  Essentially, the index file for the linkcontext table is simply a set of pairs - the lid of the link, and the offset of the linkcontext record corresponding to the data.  I did lose a bit of flexibility with this system - Notice that the link href is now a maximum size of 127 characters, down from a theoretical maximum of 255 earlier.  I decided to take this tack because the number of URLs that were longer than 127 characters are less that 0.01 percent of the URLs in the database, so I figured it was an acceptable loss.

Actually, I originally tried cutting the database down to 63 characters, but that unfortunately cut out a bunch of interesting sites, like Reuters, which use long URLs to designate topics and types of headlines, for example.  So, 127 was the right number.

The biggest changes came to the links table - Note that every column has a fixed width.  What that means is that MySQL no longer has to do reindexing of the table whenever inserting or deleting records - all it needs to do is a quick multiplication of the lid offset with the total number of bytes per record.

But wait, what does all this mean when doing SELECT calls?  Doesn't it mean that I have to SELECT information from two tables instead of one?

Yes, that's true, but again, we can take advantage of the power of left-handed JOINs.  Here's how I extracted data from the links table before:

SELECT lid, bid, href, linktext, priortext, aftertext, nearestpermalink, UNIX_TIMESTAMP(created) AS created, UNIX_TIMESTAMP(updated) AS updated FROM links WHERE href LIKE 'http://www.sifry.com/%' AND current='Y' ORDER BY created DESC

Slow stuff, because we had to jump around the indexes looking for the contents of the href column.

Here's what it became:

SELECT links.lid AS lid, bid, href, linktext, priortext, aftertext, nearestpermalink, UNIX_TIMESTAMP(created) AS created, UNIX_TIMESTAMP(updated) AS updated FROM links,linkcontext WHERE href LIKE 'http://www.sifry.com/%' AND current='Y' AND links.lid = linkcontext.lid ORDER BY created DESC

No need for two SELECT calls - or any change to the other business logic following the SQL - by waiting until the very last item in the WHERE clause, I've reduced the number of columns that match the WHERE to an absolute minimum before matching up the links with the related linkcontext data.  The speedup?  From minutes per query to seconds (or less).

I made some other changes to the codebase, including a number of bug fixes and content change, but I'll discuss that in another blog post...