Automatically Backup Your Data from Online Services (Part II)

In my previous post I advised that if you must use an online service, make sure the service offers a means to export your data so you can back it up. I wrote (read source, download) as a means to automatically extract my data from the web sites I used frequently. The first two candidates were probably Bloglines and Furl, although I don’t use either of those any more.

I designed the script to expect any number of “jobs” as I called them. A job might be to get your bookmarks from, or to get a dump of a local MySQL database, or to send the contents of the script itself in case I updated it during the day (mindblowing….wrap your head around that). The jobs can be seen at the top of the file.

In most cases, I use wget to get remote files. It’s tailor-made for this kind of application. For instance, online services typically require that you be logged in to export your data (a reasonable request). They determine you are logged in by checking the cookies you pass them in the request. So once you figure out what cookies a site sets to determine whether you are logged in you can copy those cookies and pass them to wget&nbspwith the “–header” parameter. (In the couple of years running the script I’ve never had to update the cookie values, which probably says more about the login policies of large internet sites than anything else.)

Once the script has compiled all the data from the disparate services it emails me the updates. Since I only want emails when some of the data has changed, I instituted a quick check on the content of the data retrieved from each service. After I download the data I run a hash algorithm (sha1) on the data. The hash is compared to the sha1 of from the previous run, which is stored on the filesystem. If the hash values match I know there hasn’t been any changes to the data and it can be ignored (i.e. not emailed). If the values are different I can assume there is a change and mail out the file, writing the new hash value to a file for comparison during the next run. (See the “get_old_digest”, “get_new_digest”, and “write_digest” routines). I chose to do it this way so I wouldn’t need to store a copy of the data itself on my web server, which could potentially be compromised. Since the sha1 reduces a large file to a small hash, it’s efficient in terms of data storage and easy to use in string comparison. And even if there are false positives every once in a while t’s not a huge deal. The worst that will happen is that I get a copy of a file when I really didn’t need to.

Each job must have a unique name. The name is used as a key in a nested hash table (e.g. “bloglines”). Each job can can have a number of options associated with it.

  • command – the command that is used to retrieve the data (required). This can be anything that Perl can execute, including system commands (e.g. wget, cat, mysqldump).
  • outfile – what the name of the file should be when it’s attached to the email.
  • zipfile – used in addition to “outfile”, this command tells to zip up the output file before attaching it to the email and specifies what the name of the zipped file should be.
  • filter – Something I had to account for is that the data frequently has timestamps in it that represents when the data was requested. Since this is different each time the data is requested the hash would always determine that the contents had changed. The script will ignore any lines in the data that match the value of the “filter” option before comparing the data from the current run to the data from the previous run.

The script relies on Digest::SHA1 and MIME::Lite, which should be installed on most hosting accounts. I have the script on my hosting account and use cron to run the script nightly. If your hosting provider doesn’t allow command line access or you’re not sure how to do this, look through the control panel for an equivalent interface.

The “GLOCAL VARIABLE DECLARATION” section has a number of options to customize. For instance, you can set “$test_only” to 1 if you want to see what the run would look like but not send the email. One last trick is that if delete all the “_digest.txt” files in the $output_path the script will assume you’re running it for the first time and send you the results of all the jobs. This is useful if you lost track of the most recent version of each job and want to catch up in one shot.

I hope you find the script useful.

Be Sociable, Share!

Leave a Reply