Archive for the 'backup' Category

The Essentials of Obsessive Backups

Rounding out a small diversion down the path of personal data backup, I thought I would document my backup philosophy and scheme. Now granted, most would think I’m absolutely over the top for the intricate plan I’ve devised over the years. Suffice it to say, I’ve thought about these details a lot and finally feel like I’m at the sweet spot between data availability and data security.

That last point is important. Your data could be replicated across every machine on the planet making it very available, but obviously very insecure. I take the challenge of finding the correct balance very seriously.

The first pillar of the philosophy is to isolate the data that should be backed up from the data that doesn’t need to be backed up. Typically the first thing I do when I get a new machine is partition into 3 or 4 drives. The C drive is left to anything that was pre-installed (operating system, shareware, etc.). I leave some extra space as a buffer here because some apps insist in being installed on C or create temporary files that live in the C drive. The D drive is for applications I’ve installed with the exception of games. And all data, regardless of what application it’s from, goes to the E drive. Usually games and pictures (12 gigs and counting) go to the F drive.

Over the years this isolation has worked in my favor a couple of times. There were times that I had to re-install the OS and was thrilled to find my E drive with all data still intact. There were times a bad game install hosed the F drive but left the other untouched. In short, drive partitioning is a must. In ancient times, the process was a little harrowing and not to be done carelessly. It’s gotten a lot easier and safer now, so there’s no excuse.

The next pillar is that backups must be automated. A backup that is not automated is almost useless, as you’ll probably do it for the first couple of weeks and then quickly lose interest. There are a ton of applications that can help with this task. I rely on a mix of SyncBack and rsync, depending on the target of the backups (more on this below).

The third pillar is having a reliable, simple, accessible offsite backup. It must be reliable for obvious reasons. It must be simple because a complicated interface or API (I’m looking at you A3) only makes it less likely that I’ll work through the frustrations when things go wrong. It must be accessible so I can get my data from any machine at any time. And it must be offsite because a fire or theft could easily compromise my home machine. I found all 4 of these with I could write endlessly about the majesty of But I’ll summarize to these short points:

  • I don’t have to install any proprietary client-side apps, such as the ones iBackup or others make you install. This is one obstacle to data accessibility that is removed.
  • Since it supports SFTP, SCP, rsync, unison, and subversion, it will work on either a PC, Mac, or *nix machine. Another obstacle removed.
  • It’s cheap. Not as cheap as A3, but pretty cheap ($1.60/gig)
  • They have great customer support, with a privacy policy that puts the customer first
  • Since they support rsync (and the others listed above), they are very developer-friendly. Since it supports SFTP, I can use a client like WinSCP if I want a GUI

Obviously this isn’t for everyone. I wouldn’t suggest it for my Aunt Millie, but for me it’s about as good as it gets.

With those pillars in place, I’ve set up the following backup scheme:

  1. Core data, including Quicken files, Word docs, and source code gets backed up to every night. Additionally, the Quicken file is encrypted using TrueCrypt for additional security.
  2. Pictures get backed up to a Dreamhost account, which gives me plenty of space to spread out. Additionally, I’ve hacked Plogger to display the photos, making this account double as a photo gallery for friends and family. Since this data isn’t critical, it’s not important to me if it gets compromised for some reason.
  3. Core data from is also backed up to a USB key I keep on my keychain. This provides additional data accessibility while incurring no additional security risk since the entire set of data is encrypted with TrueCrypt.
  4. Most recently I purchased a $60 USB hard drive that is connected to my home machine. This backs up all data and photos every hour. The reason for this is that in the case of data loss it would be a lot easier to restore from the USB drive than from downloading from or Dreamhost. Also, it provides a clear data transfer path when the time comes to move to a new machine.
  5. All the data on my E drive is also kept in a Subversion repository. Data versioning is a little different than backup. The goal here is to make sure that if some file becomes corrupted I could roll back to a previous state. This is not ensured by most backup schemes, where only 1 version of each file is kept. The subversion repository also happens to be backed up to both, the USB key, and the USB harddrive. Again, just in case.

I feel good about the logic here, but I’m constantly thinking about whether I’ve done too much or not enough. Admittedly, that’s obsessive.


Automatically Backup Your Data from Online Services (Part II)

In my previous post I advised that if you must use an online service, make sure the service offers a means to export your data so you can back it up. I wrote (read source, download) as a means to automatically extract my data from the web sites I used frequently. The first two candidates were probably Bloglines and Furl, although I don’t use either of those any more.

I designed the script to expect any number of “jobs” as I called them. A job might be to get your bookmarks from, or to get a dump of a local MySQL database, or to send the contents of the script itself in case I updated it during the day (mindblowing….wrap your head around that). The jobs can be seen at the top of the file.

In most cases, I use wget to get remote files. It’s tailor-made for this kind of application. For instance, online services typically require that you be logged in to export your data (a reasonable request). They determine you are logged in by checking the cookies you pass them in the request. So once you figure out what cookies a site sets to determine whether you are logged in you can copy those cookies and pass them to wget&nbspwith the “–header” parameter. (In the couple of years running the script I’ve never had to update the cookie values, which probably says more about the login policies of large internet sites than anything else.)

Once the script has compiled all the data from the disparate services it emails me the updates. Since I only want emails when some of the data has changed, I instituted a quick check on the content of the data retrieved from each service. After I download the data I run a hash algorithm (sha1) on the data. The hash is compared to the sha1 of from the previous run, which is stored on the filesystem. If the hash values match I know there hasn’t been any changes to the data and it can be ignored (i.e. not emailed). If the values are different I can assume there is a change and mail out the file, writing the new hash value to a file for comparison during the next run. (See the “get_old_digest”, “get_new_digest”, and “write_digest” routines). I chose to do it this way so I wouldn’t need to store a copy of the data itself on my web server, which could potentially be compromised. Since the sha1 reduces a large file to a small hash, it’s efficient in terms of data storage and easy to use in string comparison. And even if there are false positives every once in a while t’s not a huge deal. The worst that will happen is that I get a copy of a file when I really didn’t need to.

Each job must have a unique name. The name is used as a key in a nested hash table (e.g. “bloglines”). Each job can can have a number of options associated with it.

  • command – the command that is used to retrieve the data (required). This can be anything that Perl can execute, including system commands (e.g. wget, cat, mysqldump).
  • outfile – what the name of the file should be when it’s attached to the email.
  • zipfile – used in addition to “outfile”, this command tells to zip up the output file before attaching it to the email and specifies what the name of the zipped file should be.
  • filter – Something I had to account for is that the data frequently has timestamps in it that represents when the data was requested. Since this is different each time the data is requested the hash would always determine that the contents had changed. The script will ignore any lines in the data that match the value of the “filter” option before comparing the data from the current run to the data from the previous run.

The script relies on Digest::SHA1 and MIME::Lite, which should be installed on most hosting accounts. I have the script on my hosting account and use cron to run the script nightly. If your hosting provider doesn’t allow command line access or you’re not sure how to do this, look through the control panel for an equivalent interface.

The “GLOCAL VARIABLE DECLARATION” section has a number of options to customize. For instance, you can set “$test_only” to 1 if you want to see what the run would look like but not send the email. One last trick is that if delete all the “_digest.txt” files in the $output_path the script will assume you’re running it for the first time and send you the results of all the jobs. This is useful if you lost track of the most recent version of each job and want to catch up in one shot.

I hope you find the script useful.


Automatically Backup Your Data from Online Services (Part I)

I am fanatical about backups. It borders on obsession. It didn’t stem out of any major data loss, it stems out of the fear of a data loss, which I guess is about the same but with more paranoia. This posed a unique issue with the advent of Web 2.0 applications, where the data is frequently stored on somebody else’s server. It took some time to work out a system that worked, but I’ve gotten it down to something of a science now and thought it’d be worthwhile to share.

There are a ton of useful services out there, but keep in mind that it’s your responsibility, not theirs, to make sure you have your data backed up. Services go out of business, change owners, have downtimes, go premium, etc. A little thought up front saves you from a frantic weekend of cutting and pasting screen fulls of data from an old service into the new one.

Below are a few rules I live by. As a preface, if you don’t have a hosted account, get one. They’re dirt cheap in most cases, and quickly becoming nearly essential. My hosted account is bang-for-the-buck the most useful service I pay for monthly. I’ve used A Small Orange for a number of years now and can highly recommend them. (If you happen to decide to use them, please consider thanking me by putting “” in the referrer box on the order form 🙂

  1. Always prefer a cloned or good-enough version of software that can be installed in a hosted web account. For instance, Basecamp is a great application. But did you know there’s a pretty good knock-off called ActiveCollab that was free until version 0.7.1? You can probably scrounge up a copy of that version still (wink wink nudge nudge). Even if you have to pay the $99 for a perpetual license for version 1.0.4, in my mind it’s still better to access and control over all your data.
  2. If you can’t find a hosted version, make sure the online service you select provides a means to export your data. Most of the big players like Google and Yahoo allow you to get backups of your data from inside the web application. If you know what you’re doing, you might want to make sure their service is compatible with something like curl or wget so you can call it from a script, which leads me to…
  3. Create a script to automatically pull all your data from each service. I’m a big believer in the motto that backups should run automatically otherwise they’re probably useless. I have just so happened to create a Perl script to backup my data from the various online applications. The script runs every night on my hosted account and emails me the results. From there the possibilities are endless. For instance, if I sent to my Google account I could keep them indefinitely and have implicit ability to search for a particular version. I choose to just copy them to my hard drive and use Subversion to keep them versioned. The important thing is collating the data from the various services in one place in an automated fashion.

This post turned out a little longer than I expected, so I’ll plan to cover the actual script in the next post.