Monday, June 8, 2009

How to restore your hacked WordPress database from Google Cache through Ruby

The first thing was downloading the information from Google. I began by doing a site search (site:reganmian.net/blog) on Google. I got several pages worth of hits, and used a wonderful plugin called Antipagination for Firefox. It allows me to right-click on the “Next” button at the bottom of the screen, and choose “load all”. It will then load every single web page under each other in the same tab - it has to be seen to be believed. After all the 7 pages of Google hits (50 hits per page) had loaded, I saved the page as an html file.

This html file of course has a lot of superfluous material - all we really need are the URLs to Google’s cached versions of my pages. A quick way of picking out only the lines we want is to use grep, a command line tool. If we save the list of URLs to filelist.html, we can for example run grep reganmian.net filelist.html > filelist2.html. This will choose only the lines that contain reganmian.net (the name of my website), and put them in the new file filelist2.html (note that you should never have the same file on both sides of the > sign). If you want to exclude certain lines (for example, Google might also have listed a bunch of files ending in /trackback or /comments, or /feed, which might not be useful to you), you can also use grep -v, which only outputs the lines not matching the search string. (For example grep -v trackback filelist2.html > filelist3.html).

Now, I open filelist3.html in TextMate. TM is a great editor, with a very powerful regexp system. Regular expressions are a very powerful way of manipulating text, but it can seem quite baffling at first. Luckily, you only need to understand a few simple ways of using them - and I find it a lot easier to do a few different regexps in TM to get what I need done, than to try to craft “the ultimate” Regexp that does everything at once. (This is why I also often use grep, when I just want to select certain lines).

Basically, we want to isolate the cache URLs, so we have to look through the file and see what unique section comes immediate before the URL. We can then delete everything up to the URL. Go to find and replace, check “use regexps”. Let’s say that the line looks like this (very simplified):

Entry: Reganmian (cached) http://google.cache.com/reganmian (url) http://reganmian

We want the line to only contain http://google.cache.com/reganmian. First we remove the first part: search for ^(.*)cached) and replace with nothing (ie. remove it). Then we remove the end: search for (url)(.*)$ and again replace with nothing. ^ means the beginning of a line, and $ the end, so the first selectes everything between the beginning of the line, and the search string, etc. Holding alt down while selecting with the mouse also enables block-select, which is a great way of removing text, if the lines are aligned.

After playing with the different tools in TextMate to “clean up” the text, we should have a text file only containing a list of URLs, all of which we wish to download. Then it’s time for the superb tool wget. It’s extremely powerful, but now we will only use a small subset of the features. wget -i filelist3.html will download every single URL listed in that file.

However, Google does not like “robots” to download files, and based on the “user agent string” which says “I am wget, and I would like this file”, it will refuse all the requests. Luckily, wget knows how to lie. With this slightly longer line wget –user-agent=”Opera/9.25 (Windows NT 6.0; U; en)” -i filelist3.html, wget will pretend to be Opera, and Google will believe it, and let us download the files. However, it’s not quite over yet, because Google also becomes suspicious if you download too many files at once. In my case, it stopped me after 150 files, and called me a robot to my face. Luckily, a quick reboot of the cable modem gave me a new IP address, and I could get the last files. Another option might be to increase the delay between each file in wget’s options.

Now we have all the files. Unfortunately, they have horrible names based on the URLs, with all kinds of strange characters, that don’t play well with other programs. I used Name Mangler to rename the files to 001.html, 002.html, etc. (Using Automator.app doesn’t seem to work, because of all the strange characters).

If we just wanted a small archive of our writings, we could stop at this point. However, we don’t simply want to preserve the old pages, we want the information back into the database so that WordPress can display it dynamically, by category, by date, etc. This requires us to extract the relevant information from each page, and insert it into a WordPress database. For this, we will use Ruby.

The task is actually not that difficult, although it took me a few hours, both because I program in Ruby so seldom that I often forget the syntax for different commands, and because this kind of automatic text extraction is always a bit tedious. I could have tried to insert the data directly into the WordPress database using SQL, but it is much easier to use the XML-RPC interface metaWeblog to let WordPress update its own database (the same API that offline blog clients use).

Ruby has built-in support for XML-RPC, and I found some examples of using this with WordPress, but they didn’t show all the options, and I spent some more time trying to figure out how to include the categories (which you have to create first on the blog), and change the posting date (so it doesn’t look like they were all written today). One thing that metaWeblog doesn’t seem to support, is to change the “slug” for a post (the short web friendly name that is a part of the URL). This was a bit important for me, because I really wanted to keep the same permalinks as before, since there are already some places out there linking to certain posts. Luckily, almost all of the permalinks were automatically generated from the title when I wrote them, and with the right date, the resulting permalink should be OK. But there are a few that I changed manually, and they will now have a different permalink.

No comments:

Post a Comment