Archive for the 'intermediate' Category


Over-engineering is like Snoring

A lot of developer cycles are spent discussing the benefits of YAGNI and KISS. On the surface it would seem that there is an army of righteous developers fighting against the demons of over-engineering and maximum complexity. And despite our valiant battles, despite all the books and blog posts and rallying calls from respected technology professionals, the demons are still churning out bloated, impossible to maintain code.

I’ll let you in on a little secret. We are the enemy. Not just the guy who sits next to you, or the guy that churned out mess of code and then left the company. You are the problem. I am the problem. The enemy is us.

Yes, we can all agree in principle that complexity is bad and simplicity is good. The problem is that complexity is completely subjective. Maybe you misjudge or were misinformed about how likely it is that a certain feature will be needed. Maybe you thought of some brilliant solution and you want to leave it as a placeholder in case you need to come back to it later (or so others can see how clever you are). Maybe you don’t want to do it the cheap way because you’re afraid others will snicker at your solution. Maybe you’re afraid a simple solution will lead to longer development times later on. Maybe your definition of simplicity is skewed. Whatever the case, no one sets out to over-complicate a piece of code. And yet it happens time and time again.

There are rules of thumb that can be followed. But what it boils down to is always discipline. It’s not easy to simplify. It sometimes feels wrong. But I’ve never looked at working code and cursed because it was too simple. I’m not even sure it’s possible for working code to be too simple. But it sure as hell easy for it to be too complex.

So why is over-engineering is like snoring? Because no one thinks they do it. And yet somehow there is a market for snoring relief aides.

2 comments

Adventures in SSL – Part II: Integration Strategy

In my first post about SSL integration on my site, I discussed how I came to a decision about a certificate issuer. I chose DigiCert, and have been very happy with them. One great bonus was their extensive list of instructions for setting up the certs on almost any web server known to man. So even though Part II of this series was intended to be about installation, I think DigiCert has that covered. Their instructions for nginx were spot on, so I wouldn’t be able to add anything meaningful to them anyway.

But buying and installing the certificate is a little different than using it. This post will focus on how I integrated the certificate into the site and what additional nginx configuration I had to make to support that strategy.

After kicking it around for a while I realized I really have 2 options. I can either convert the entire site to use https or convert as few pages as possible (e.g. just the login and register pages). The argument for a limited use of https is that all else being equal, the web server will require a little more CPU to encrypt/decrypt the https traffic. This is apparently an issue particularly with nginx as even the creator has said it can drag down performance for high-traffic sites. Since I’m not expecting Amazon-level traffic, this wasn’t as big a deal to me.

Another argument for limiting the use of https is that some low-cost CDNs, such as Amazon CouldFront, don’t support https traffic. This was a concern for me. I will eventually want to move my images, screencasts, stylesheets, and JS files to a CDN, so the fewer https pages I have the less of an issue this would be.

Related to this, some posts I read claimed that browsers will refuse to cache images, CSS, and scripts if they came across https. In my testing with Charles in Firefox and IE on Windows I did not experience that. In other words, any files that could be cached by the brower were cached. Yes, it was a limited test, but it covers a lot of the target base of my app. I believe either this used to be the case and no longer is or it’s one of those old wive’s tales that people just assume is the case but have never really taken the time to test.

I saw a couple of benefits for using https for the whole site. The first was that it simplified my application architecture. For instance, say you have a login page that’s intended to be served over https but it includes a common header image that’s present on all pages. That image has to also be served over https on the login page or the user will get a popup warning message that the page contains both secure and insecure content. That message is at least annoying if not scary to some users, so it’s best to avoid it by ensuring that the image is served up via https. But that means you may have a situation where you have 2 copies of that image so that it can be served up by both https and http. Or your configuration might become more complex in order to support 2 virtual servers pointing at the same image file on disk. Either way it’s a complicating factor that I wasn’t thrilled about wasting time on. If the entire site is served over https this issue goes away.

Secondly, it would be easier to configure than having only some pages be served via https. For instance, let’s say the login page is https. If someone asks for that page via http, the server should be nice and redirect them to https. But for almost all other pages it should allow regular http requests to process normally. These exceptions are easy to handle for one or two pages, but for more than a couple that quickly becomes difficult to manage effectively.

Lastly, my application is targeted at kids in the 10 to 15 years old range. For me, the more security the better. As with any site that relies on cookies to identify logged in users, it’s theoretically possible to hijack someone’s session via the cookie value, and if that were to happen it would lead to some seriously bad press for me. Again, if the entire site is accessed over https this issue goes away.

So as you can probably guess, I decided to serve the entire site over https. The big question I haven’t answered here is what effects this had on performance. I’ll discuss that the final installment in this series. But for those also using nginx, below is an excerpt of the config changes I made to support this. It should be self-explanatory, but leave me a comment if you need any help through it.


# non-secure site - send all requests to https
server {

        server_name www.mysite.com mysite.com;
        listen 80;

        location / {
           rewrite ^/(.*)$ https://www.mysite.com/$1 permanent;
        }
}

# secure site
server {

        server_name www.mysite.com mysite.com;

        listen 443;
        ssl on;
        ssl_certificate /path/to/pem/file;
        ssl_certificate_key /path/to/key/file;
        .....
}
0 comments

Facebook Status Updates and Infinite Session Keys

Anyone have the first clue as to why Facebook’s developer documentation sucks so hard?

I was developing a simple Facebook application for one of my company’s clients that required me to update a user’s status via a scheduled background process. The developer documentation lead me down all kinds of paths by referencing infinite session keys and the “keep me logged in” check box. So I scoured the internets for some examples, only to find that there aren’t many. All these claims that bajillions of people are creating Facebook apps and not a single one of them that are updating a user’s status offline can document it? ARRRGGG!

So, here is what I hope will save someone else a ton of time – a real life, working code sample for updating a user’s Facebook status offline. Careful – make no sudden moves or you might scare this rare beast back into hiding.

Our app is requesting two extended permissions – “offline_access” and “status_update”. This is also using Elliot Haughin’s Facebook plugin for CodeIgniter. Elliot’s package includes an older version of the Facebook PHP Library, so I had to grab the latest version from Facebook and drop it in place. Other than that it was easy to integrate this into my app.

//http://wiki.developers.facebook.com/index.php/Users.hasAppPermission
//must be one of:
//   email, read_stream, publish_stream, offline_access, status_update, photo_upload,
//   create_event, rsvp_event, sms, video_upload, create_note, share_item
if( $this->facebook_connect->client->users_hasAppPermission("offline_access", $fbUID) &&
    $this->facebook_connect->client->users_hasAppPermission("status_update", $fbUID) ){
    $this->facebook_connect->client->users_setStatus("some status message", $fbUID);
}

Seriously, that’s it! All those posts, all that searching – for 3 lines of code! The key point that was conveniently left out of other articles is that there is no “session key” required now. Facebook is smart enough to know that the user granted the app permission for offline_access and status_update, so you only need to send the user’s Facebook ID. Moley.

Another annoyance. They make a big deal out of the fact that they provide a REST-ful interface, but none of the examples in their documentation show the format of the REST request (although they do at least provide the REST server URL and a handy hint to include the “Content-Type: application/x-www-form-urlencoded” header). Yes, I get it, you want me to use the PHP Library, which is nicely designed. But for quick and dirty testing I like to whip up some curl commands. If I don’t know how to format the XML I can’t easily do that. Bah!

3 comments

CodeIgniter…Meet Minify

NOTE: This post has an update that explains an improved technique. The technique below will still work (with some tweaks for CodeIgniter 1.7.1 or above), but is probably not preferred at this point.

As a followup to one of my previous posts I wanted to go through how I managed to get CodeIgniter and Minify to play nice with each other. Hopefully this will make someone else’s life easier. For those not using CodeIgniter this post might be either confusing or boring. Or both I guess.

My approach might seem code-heavy compared to other solutions but it has the virtue of requiring only a small change to single file that would be included by all pages on your site. That’s typically not a problem since the first thing I do when I’m working on a site is to break out the common elements such as the <html> and <head> tags to their own included header file.

In CodeIgniter I created a library called MY_Includes.php (/system/application/libraries/MY_Includes.php). This is the core class that contains the mappings between each controller and the JavaScript and CSS files required by the view that will be loaded by the controller that was invoked by the browser. Obviously this implies the extra step. If I create a new JavaScript or CSS file I can’t go into the globally included header file and add a <script> or <link> tag there – I have to edit MY_Includes.php to map the JavaScript or CSS file to that particular view. Yea, it seems weird to edit a PHP file to add a CSS or JavaScript file, but there are a couple of different factors at work here and this solution made the most sense to me. The big win was that it helped integrate Minify into my codebase with almost minimal effort.

You can see an edited version of MY_Includes.php here (Note: this is an old version). I wanted to walk through this code a bit to highlight the important parts, but hopefully it’s readable on its own.

First, you’ll notice the constructor requires the name of the controller that was invoked. I’ll show you how I get that later on, but essentially the whole class relies on that piece of information. My application is fairly linear in the sense that once I know the controller’s name I know (barring exception cases) which view will be invoked.

This in turn allows me to map controllers directly to JS and CSS files, which is why you’ll see the init method set up 2 hashes containing the JS and CSS files that I have access to, jsFilesHave and cssFilesHave. The key in the hash is a logical name I will use when adding the file to a view. This will improve readability and reduce errors and maintenance. The value in the hash is a string that specifies where the corresponding source file can be found. This is relative to the web root and is of a form that Minify understands. Whenever I create a new JS or CSS file I have to first add it to one of these hashes so that I can refer to it later in the file.

One other note on the init – I’m not sure if I needed to, but I found it easiest to break with the CodeIgniter way of doing things and issue a PHP include statement to tell the class where to find the Minify source in the below snippet from that method.

//from minify examples:
//Add the location of Minify’s “lib” directory to the include_path.
ini_set(‘include_path’, ‘/home/vdibart/minify/lib/.:’ . ini_get(‘include_path’) );
require ‘Minify/Build.php’;
require ‘Minify.php’;

After init, the constructor will call compileTags. This is the heart of the logic. You can see it populate the cssFilesNeed and jsFilesNeed hashes, first with the files that are common to all views and then the ones depending on which controller was invoked.

Determining which controller was invoked is fairly straightforward. The following code is at the top of my globally included header file:

//for globally included header file
//so know which CSS or JS files to include
$pageName = $this->uri->segment(1, 0);
$pageName .= “/” . $this->uri->segment(2, “index”);
$this->load->library(“MY_Includes’, $pageName);

So if the controller was “http://www.mysite.com/member/register”, this code will pass “member/register” to the constructor of my class. Later on in the same header file I have the following 2 lines, which will extract the appropriate CSS and JS links:

<!– for globally included header file –>
<link rel=”stylesheet” href=”<?= $this->CI->my_includes->cssTag(); ?>” type=”text/css” media=”screen” />
<script src=”<?= $this->CI->my_includes->jsTag(); ?>” type=”text/javascript” charset=”utf-8″></script>

Switching back to the source code of MY_Includes.php, you can see those 2 methods invoke Minify to build the included files and then return a URL that can be used to retrieve the files. There’s a little bit of work in each of those to make the URL look like something that CodeIgniter will work with. So once the PHP executes the above tags will look like this in the final source code for the page:

<link rel=”stylesheet” href=”http://www.mysite.com/includetag/css/member-register/1222014216″ type=”text/css” media=”screen” />
<script src=”http://www.mysite.com/includetag/js/member-register/1222098068″ type=”text/javascript” charset=”utf-8″></script>

So each rendered page on my site has only 1 CSS file and 1 JS file included. And those files are minimized and cached. All of that is due to Minify. But you’ll notice there’s one piece of the puzzle still missing. The above <link> and script tags refer back to my site, and there has to be something that knows how to interpret that and return the appropriate CSS or JavaScript data. It turns out that “includetag” is a CodeIgniter controller that I created. I’ve included the source code here. There’s not a ton to mention here. The class loads the exact same helper class MY_Includes.php that interfaces with Minify to retrieve the CSS or JS file and return them to the client.

Hopefully there’s enough to get you through to a working version. To summarize the steps:

  1. Download MY_Includes.php (here – see updated version) and put it in your /system/applications/libraries directory
  2. Edit the init method inside of MY_Includes.php to include the correct path to your Minify installation
  3. Edit the init method inside of MY_Includes.php to include your CSS and JS files
  4. Edit the compileTags method inside of MY_Includes.php to include the correct files for each controller
  5. Download includetag.php (here) and put it in /system/applications/controllers directory
  6. Add the two code fragments commented with “for globally included header file” above to the appropriate file in your application
  7. Fire it up

Feel free to post a comment if you have troubles and I’ll walk you through it or edit the post to fix any errors as needed.

NOTE: This post has an update that explains an improved technique. The technique above will still work (with some tweaks for CodeIgniter 1.7.1 or above), but is probably not preferred at this point.

14 comments

In Praise of Minify

Having read High Performance Web Sites, I figured I’d take a little time out of the development of new features on my side project to look at some basic performance issues. The first stop was YSlow, the Firefox plugin that works with Firebug to give you a simple report on how you rate on the Yahoo! performance scale. Mine being a tiny site, the report before any optimizations was decent but not great. There was definitely room for improvement so I figured I’d put some of the advice I’ve read recently into practice.

The first optimization was very easy. I made sure my images were sufficiently cached by adding a quick .htaccess file in the directory where my images are stored on the server. I saw 2 different techniques for doing this. One was based on file extension, such as the technique discussed here. The second was based on the file’s content-type, which was discussed here. On the margin the one based on content-type seemed a safer bet. That way if I have a file that’s incorrectly named it will still get cached.

The next step was to try to improve my JavaScript and CSS includes. As mentioned in High Performance Web Sites, the files should be minimized in order to save bandwidth. They should have far future expires headers so that the browser doesn’t request them after the first visit. And the number of includes should be limited so that there’s fewer requests that need to be made. Luckily someone much smarter than I already developed just about the perfect solution to all those issues and more. The Minify library for PHP is one of those pieces of code that does exactly what I was hoping it would do in exactly the way I was hoping it would. And to boot it required as little effort to integrate into my existing code base as could reasonably be expected. I recommend that anyone running even a small site on their own take a look at Minify. There’s absolutely no reason not to be using this wonderful little library. None. Go out right now and do it.

There was one snag in process of integrating Minify with my project. As I’ve mentioned, I’m using the CodeIgniter framework. It turns out that Minify and CodeIgniter needed a little bit of coaxing to work together, but nothing that got too messy. I’m going to leave that discussion for my next post, which will hopefully not take 4 months to write :)

3 comments

Wherein I Question the Usefulness of MVC

I decided to use CodeIgniter for a PHP project that I’m working on. CodeIgniter is an MVC framework, not too unlike CakePHP. At least I imagine they’re very similar, but I can’t say for sure as the reason I chose CodeIgniter over CakePHP was that the CakePHP documentation is a mess and I didn’t have time to wade through it. CodeIgniter has been fairly easy to work with so far. I’m sure there are tons of CodeIngiter reviews by developers like me out there, so I won’t bore you with that just yet (future post!).

This post is about Model-View-Controller (MVC) architecture. Like any developer, I’ve read countless retellings of why patterns and MVC are good for your code. True to form, I think those claims are overblown. I’ve worked with people that do everything “By the Book” and I’ve worked with people that hack everything together as best they can. Seeing both sides of it I honestly can’t say that one made my life any better than the other. Unstructured code, if kept reigned in to some degree, can be incredibly flexible and allow you to be agile in the face of rapidly-changing priorities.

For instance, I’m not above having SQL statements in a JSP file. I don’t love it. I try to avoid it if it’s going to get messy. But I don’t think it’s something to be embarrassed about. I can’t tell you how many times I’ve been able to move a change out in minutes rather than weeks because I was able to tweak a query in the JSP. No, it’s not “By the Book”. But it works, and in the end that’s what you get paid for.

My general rule of thumb is that the closer to the end user your code is the more flexible it has to be. Consider the following range of technologies that flow from the user end to the server side: HTML/CSS, Javascript, PHP/Java/Ruby, PL/SQL, database schema. HTML needs to be more flexible than Java, which needs to be more flexible than the database schema. So for every 1000 times you tweak your HTML or CSS, you might need to make a couple of changes to your backend Java. Sounds reasonable.

So coming back to MVC, one thing I’ve never understood is why the controller is responsible for selecting which view is invoked. This seems fundamentally flawed to me. In a language like Java the controller is a servlet compiled into a jar file somewhere. To change the behavior of that file you have to go through an entire release process: change code, test, promote to QA, test, promote to production, test. At MLB, a change like that took about 2 weeks from start to finish. (Obviously the situation is a little different if you use PHP, which is why I’ve decided to use an MVC framework for the PHP project).

In essence, it’s like the backend developers are saying “Move aside HTML, let the big boys make the call. We know better which file should be displayed”. You know what, they don’t and they shouldn’t. Yes, I know about Front Controllers. Yawn. Yes, I know you could easily write the system such that the flow through the views is configured using XML so it can be changed on the fly, as they did at MLB. Snore. Don’t get me started on XML for configuration. These are all solutions in search of a problem. These things can be done, but no one has really ever convinced me that they need to be done. Agility requires simplicity. Simplicity can’t be configured with XML.

0 comments

The request_token Pattern

The idea behind the request token is another one of those simple-but-powerful patterns that I’ve come to rely on in various systems. I’ll jump right into an example of a case where I wanted to use it but alas I didn’t get to make the change before I left the job.

The architecture was a simple producer-consumer model. Some piece of the system was responsible for placing a row into a table and another was responsible for finding those rows and processing them. As it turns out, the system required many more consumers than producers, which I realize is not all that uncommon.

(Before you go screaming at me about “enterprise” solutions like Oracle’s Advanced Queueing or JMS, that’s not entirely the point. It’s incidental that this situation looks like a producer-consumer problem, but this pattern in more generally useful. So bear with me and think about how to apply it elsewhere.)

So, applying it to an email system where one piece of the system generates the emails and dumps them into a table and another piece of the system takes them out and sends them, you might have a table that looks like this:


CREATE TABLE email_jobs
(id NUMBER NOT NULL
,email_to VARCHAR2(255) NOT NULL
,email_subject VARCHAR2(255) NOT NULL
,email_body VARCHAR2(255) NOT NULL
,insert_ts DATE DEFAULT SYSDATE NOT NULL
,update_ts DATE
,processed_ts DATE DEFAULT SYSDATE NOT NULL

You can imagine the consumer might wake up, ask for the oldest 10 items in the table, send them off in batch, and then go back to sleep. As you might expect, I had a recurring problem where 2 consumers were both attempting to pull the same item from the table and process it. In the above case, a bug like that might lead to the person getting 2 identical emails, which no one wants. There are ways to protect against these kinds of things at the level, but in reality you just want to ensure that no 2 consumers get the same item.

Enter the request token. With this, each consumer produces the a unique indentifier and marks the rows that it wants with that value. It then requests only the rows with that token, making it virtually impossible to have the same row processed by 2 different consumers.


CREATE TABLE email_jobs
(id NUMBER NOT NULL
,email_to VARCHAR2(255) NOT NULL
,email_subject VARCHAR2(255) NOT NULL
,email_body VARCHAR2(255) NOT NULL
,request_token VARCHAR2(255)
,insert_ts DATE DEFAULT SYSDATE NOT NULL
,update_ts DATE
,processed_ts DATE DEFAULT SYSDATE NOT NULL

Notice the addition of the request_token column. On the application side:


//produces a unique number
$token = generate_token()


//mark some rows with the token – only where the request_token is already null – important!
UPDATE email_jobs SET request_token = $token WHERE <….find oldest rows…> AND request_token is null


//do this so other consumers won’t see these rows
COMMIT


//go back and find the ones that you marked
SELECT ej.id FROM email_jobs ej WHERE request_token = $token

Even if you have more than one process hitting that table, one of them will overwrite the other’s value for the request_token. Therefore, unless your application is sensitive to the number of rows each consumer processes, this is completely safe in that it won’t lead to multiple consumers processing the same row.

In general, the request token pattern pre-marks some data so that it’s easy to find later on. Another example that I’ve used in the past is in account creation. What frequently happens is that you have to insert a row and the update it soon after. The problem is that the insert generates a new unique ID that the update needs to know, but sometimes doesn’t. My solution has been to pass a request token to the code that does the insert and then pass that same value to the code that does the update. As long as the request token is unique they should both be able to address the correct row.

At this point you might have the idea to create the request_token column with a UNIQUE constraint so that no two rows can have the same value. Not so fast. In an even more useful case, there have been times when I’ve had to create a bunch of rows and then manipulate them in bulk. So, for instance, create a bunch of new accounts and set their email address to the same value. Without a column like the request_token, you’d potentially having nothing to group them by except for an insert_ts or similar column. With the request_token, it becomes a very easy thing to do.

3 comments

What’s So Great about PL/SQL

I thought I’d start a loose series on PL/SQL for server-side developers. As a developer who has had to defend my use of PL/SQL in various systems over the years, I have some pretty strong feelings about what it brings to the table. I think of PL/SQL as a first-class language. That’s not to say that it can be used wherever Java or PHP are. What I love most about PL/SQL is that it fills some major gaps that Java and PHP (and most other traditional programming languages) have. When it comes to manipulating the database, anything Java can do PL/SQL can do better. In the context of the modern database application, that means that PL/SQL is an essential piece of any system.

Server-side developers in general have some serious hangups about PL/SQL. For one, it looks weird. What? No braces?!?!? Impossible!

Look, it’s not Java. Heck, it’s not even PHP. PL/SQL is its own beast, and you have to learn how to pet that beast so it doesn’t turn on you (and take your database down with it). If you think of PL/SQL as a simple means to tie some logic around DML operations (select/create/update/delete) it begins to make a lot more sense. It’s not supposed to be elegant. It’s not supposed to require hours and hours poring over thick books with fancy titles. It’s supposed to help you build better database applications, and at this I believe it excels.

So what’s so great about PL/SQL? Here are my canned responses to that question whenever some upstart developer starts spewing the crap he read out of his textbooks:

  1. PL/SQL is compiled in the database. It always amuses me that a community like Java, which lives and dies by strong compile-time typing, is perfectly willing to let a major component of their application be loosely typed. You know all those JDBC calls/Hibernate mappings/iBatis queries? Little news for you Java dude. They’re completely unchecked. Put in terms you might understand – when you enter in a period, there’s no code assist to help you figure out how to complete the query. If I go in and modify the database in a few discrete ways your app will crash and burn. And you probably won’t realize this until a user sends a nasty email about why they can’t access the product they purchased. Not the case with PL/SQL. Since Oracle keeps them compiled in the database, you (or more likely the DBA) will know immediately if something changes in such a way that breaks the procedure or package.
  2. Since they’re compiled in the database they will run orders or magnitude faster than the corresponding queries requested by a client application. The important concept here is called context switching. In short, it turns out all those trips back and forth the database tends to slow things down. It’s much much quicker to bundle up related queries in a procedure and make one call to the procedure. I once had an argument with a Java developer about result set sorting. He was convinced that it was much faster to sort a list of objects in Java than it would be to have the database do the ORDER BY and return the results. I like the guy, but that’s just insane. The overhead of fetching each of those rows and then doing some lame bubble sort on them is astronomical. But this is the kind of thinking that infests the server side community. It’s borne out of ignorance, sometimes willful, of what a database can do.
  3. Another benefit of being in the database – they can be used by any client, not just ones written in Java (or PHP, etc.). When they talk about code reuse Java developers apparently don’t consider these kinds of issues. I’m sure it’s a wonderful learning experience to write a shipping cost calculator in Java, PHP, and JavaScript, but wouldn’t it make more sense to write it in PL/SQL once and then use it everywhere? Just a thought.
  4. Believe it or not, most of the good DBAs I’ve worked with prefer complex logic to be wrapped up somewhere they can keep an eye on it. Remember, if something breaks at 3am they’re the ones that will get paged. Having all that business logic tucked away in a jar file somewhere makes then nervous. And when things do go bad they can help a lot more when the code is in the database. It’s better for everyone.
  5. It takes about 15 minutes to learn enough PL/SQL to export some logic to the database. Sure, PL/SQL goes deeper than that, but any curly-brace type programmer should be able to absorb the concepts easily.

Now, I’m not going to say it’s all win-win. Moving business logic into the database has a dramatic effect on system design. You’ll find a lot less justification for something like Hibernate, for instance (ok, maybe that’s a win). I’ve been through this a couple of times, I know it’s hard to find the appropriate place to draw the line in terms of what gets moved into the database. Should you go balls to the wall and have the database return cursors for select statements? I usually don’t, but I have in some instances. Should every insert/update/delete be wrapped in a stored proc? Again, not an easy call.

In my most recent fantasy baseball app, I let the client code only insert into temporary tables, and then called a stored proc to validate the data and move it into the destination table. People look at me like I’m crazy when I tell them about this. But you know what? I’d do it again. If you buy the premise that inserts are dangerous because Java code can’t type-check them, then it’s the right way to do it. Temporary tables and stored procedures are much easier to change than Java code at most “serious” companies. It’s a matter of necessity to do it that way.

Hopefully I’ve covered the “whys” of PL/SQL convincingly enough. In a future post I’ll cover some basics of the “hows”.

2 comments

The Essentials of Obsessive Backups

Rounding out a small diversion down the path of personal data backup, I thought I would document my backup philosophy and scheme. Now granted, most would think I’m absolutely over the top for the intricate plan I’ve devised over the years. Suffice it to say, I’ve thought about these details a lot and finally feel like I’m at the sweet spot between data availability and data security.

That last point is important. Your data could be replicated across every machine on the planet making it very available, but obviously very insecure. I take the challenge of finding the correct balance very seriously.

The first pillar of the philosophy is to isolate the data that should be backed up from the data that doesn’t need to be backed up. Typically the first thing I do when I get a new machine is partition into 3 or 4 drives. The C drive is left to anything that was pre-installed (operating system, shareware, etc.). I leave some extra space as a buffer here because some apps insist in being installed on C or create temporary files that live in the C drive. The D drive is for applications I’ve installed with the exception of games. And all data, regardless of what application it’s from, goes to the E drive. Usually games and pictures (12 gigs and counting) go to the F drive.

Over the years this isolation has worked in my favor a couple of times. There were times that I had to re-install the OS and was thrilled to find my E drive with all data still intact. There were times a bad game install hosed the F drive but left the other untouched. In short, drive partitioning is a must. In ancient times, the process was a little harrowing and not to be done carelessly. It’s gotten a lot easier and safer now, so there’s no excuse.

The next pillar is that backups must be automated. A backup that is not automated is almost useless, as you’ll probably do it for the first couple of weeks and then quickly lose interest. There are a ton of applications that can help with this task. I rely on a mix of SyncBack and rsync, depending on the target of the backups (more on this below).

The third pillar is having a reliable, simple, accessible offsite backup. It must be reliable for obvious reasons. It must be simple because a complicated interface or API (I’m looking at you A3) only makes it less likely that I’ll work through the frustrations when things go wrong. It must be accessible so I can get my data from any machine at any time. And it must be offsite because a fire or theft could easily compromise my home machine. I found all 4 of these with rsync.net. I could write endlessly about the majesty of rsync.net. But I’ll summarize to these short points:

  • I don’t have to install any proprietary client-side apps, such as the ones iBackup or others make you install. This is one obstacle to data accessibility that is removed.
  • Since it supports SFTP, SCP, rsync, unison, and subversion, it will work on either a PC, Mac, or *nix machine. Another obstacle removed.
  • It’s cheap. Not as cheap as A3, but pretty cheap ($1.60/gig)
  • They have great customer support, with a privacy policy that puts the customer first
  • Since they support rsync (and the others listed above), they are very developer-friendly. Since it supports SFTP, I can use a client like WinSCP if I want a GUI

Obviously this isn’t for everyone. I wouldn’t suggest it for my Aunt Millie, but for me it’s about as good as it gets.

With those pillars in place, I’ve set up the following backup scheme:

  1. Core data, including Quicken files, Word docs, and source code gets backed up to rsync.net every night. Additionally, the Quicken file is encrypted using TrueCrypt for additional security.
  2. Pictures get backed up to a Dreamhost account, which gives me plenty of space to spread out. Additionally, I’ve hacked Plogger to display the photos, making this account double as a photo gallery for friends and family. Since this data isn’t critical, it’s not important to me if it gets compromised for some reason.
  3. Core data from rsync.net is also backed up to a USB key I keep on my keychain. This provides additional data accessibility while incurring no additional security risk since the entire set of data is encrypted with TrueCrypt.
  4. Most recently I purchased a $60 USB hard drive that is connected to my home machine. This backs up all data and photos every hour. The reason for this is that in the case of data loss it would be a lot easier to restore from the USB drive than from downloading from rsync.net or Dreamhost. Also, it provides a clear data transfer path when the time comes to move to a new machine.
  5. All the data on my E drive is also kept in a Subversion repository. Data versioning is a little different than backup. The goal here is to make sure that if some file becomes corrupted I could roll back to a previous state. This is not ensured by most backup schemes, where only 1 version of each file is kept. The subversion repository also happens to be backed up to both rsync.net, the USB key, and the USB harddrive. Again, just in case.

I feel good about the logic here, but I’m constantly thinking about whether I’ve done too much or not enough. Admittedly, that’s obsessive.

0 comments

Automatically Backup Your Data from Online Services (Part II)

In my previous post I advised that if you must use an online service, make sure the service offers a means to export your data so you can back it up. I wrote mydump.pl (read source, download) as a means to automatically extract my data from the web sites I used frequently. The first two candidates were probably Bloglines and Furl, although I don’t use either of those any more.

I designed the script to expect any number of “jobs” as I called them. A job might be to get your bookmarks from del.icio.us, or to get a dump of a local MySQL database, or to send the contents of the script itself in case I updated it during the day (mindblowing….wrap your head around that). The jobs can be seen at the top of the file.

In most cases, I use wget to get remote files. It’s tailor-made for this kind of application. For instance, online services typically require that you be logged in to export your data (a reasonable request). They determine you are logged in by checking the cookies you pass them in the request. So once you figure out what cookies a site sets to determine whether you are logged in you can copy those cookies and pass them to wget&nbspwith the “–header” parameter. (In the couple of years running the script I’ve never had to update the cookie values, which probably says more about the login policies of large internet sites than anything else.)

Once the script has compiled all the data from the disparate services it emails me the updates. Since I only want emails when some of the data has changed, I instituted a quick check on the content of the data retrieved from each service. After I download the data I run a hash algorithm (sha1) on the data. The hash is compared to the sha1 of from the previous run, which is stored on the filesystem. If the hash values match I know there hasn’t been any changes to the data and it can be ignored (i.e. not emailed). If the values are different I can assume there is a change and mail out the file, writing the new hash value to a file for comparison during the next run. (See the “get_old_digest”, “get_new_digest”, and “write_digest” routines). I chose to do it this way so I wouldn’t need to store a copy of the data itself on my web server, which could potentially be compromised. Since the sha1 reduces a large file to a small hash, it’s efficient in terms of data storage and easy to use in string comparison. And even if there are false positives every once in a while t’s not a huge deal. The worst that will happen is that I get a copy of a file when I really didn’t need to.

Each job must have a unique name. The name is used as a key in a nested hash table (e.g. “bloglines”). Each job can can have a number of options associated with it.

  • command – the command that is used to retrieve the data (required). This can be anything that Perl can execute, including system commands (e.g. wget, cat, mysqldump).
  • outfile – what the name of the file should be when it’s attached to the email.
  • zipfile – used in addition to “outfile”, this command tells mydump.pl to zip up the output file before attaching it to the email and specifies what the name of the zipped file should be.
  • filter – Something I had to account for is that the data frequently has timestamps in it that represents when the data was requested. Since this is different each time the data is requested the hash would always determine that the contents had changed. The script will ignore any lines in the data that match the value of the “filter” option before comparing the data from the current run to the data from the previous run.

The script relies on Digest::SHA1 and MIME::Lite, which should be installed on most hosting accounts. I have the script on my hosting account and use cron to run the script nightly. If your hosting provider doesn’t allow command line access or you’re not sure how to do this, look through the control panel for an equivalent interface.

The “GLOCAL VARIABLE DECLARATION” section has a number of options to customize. For instance, you can set “$test_only” to 1 if you want to see what the run would look like but not send the email. One last trick is that if delete all the “_digest.txt” files in the $output_path the script will assume you’re running it for the first time and send you the results of all the jobs. This is useful if you lost track of the most recent version of each job and want to catch up in one shot.

I hope you find the script useful.

0 comments

Next Page »