Archive for the 'intermediate' Category


What’s So Great about PL/SQL

I thought I’d start a loose series on PL/SQL for server-side developers. As a developer who has had to defend my use of PL/SQL in various systems over the years, I have some pretty strong feelings about what it brings to the table. I think of PL/SQL as a first-class language. That’s not to say that it can be used wherever Java or PHP are. What I love most about PL/SQL is that it fills some major gaps that Java and PHP (and most other traditional programming languages) have. When it comes to manipulating the database, anything Java can do PL/SQL can do better. In the context of the modern database application, that means that PL/SQL is an essential piece of any system.

Server-side developers in general have some serious hangups about PL/SQL. For one, it looks weird. What? No braces?!?!? Impossible!

Look, it’s not Java. Heck, it’s not even PHP. PL/SQL is its own beast, and you have to learn how to pet that beast so it doesn’t turn on you (and take your database down with it). If you think of PL/SQL as a simple means to tie some logic around DML operations (select/create/update/delete) it begins to make a lot more sense. It’s not supposed to be elegant. It’s not supposed to require hours and hours poring over thick books with fancy titles. It’s supposed to help you build better database applications, and at this I believe it excels.

So what’s so great about PL/SQL? Here are my canned responses to that question whenever some upstart developer starts spewing the crap he read out of his textbooks:

  1. PL/SQL is compiled in the database. It always amuses me that a community like Java, which lives and dies by strong compile-time typing, is perfectly willing to let a major component of their application be loosely typed. You know all those JDBC calls/Hibernate mappings/iBatis queries? Little news for you Java dude. They’re completely unchecked. Put in terms you might understand – when you enter in a period, there’s no code assist to help you figure out how to complete the query. If I go in and modify the database in a few discrete ways your app will crash and burn. And you probably won’t realize this until a user sends a nasty email about why they can’t access the product they purchased. Not the case with PL/SQL. Since Oracle keeps them compiled in the database, you (or more likely the DBA) will know immediately if something changes in such a way that breaks the procedure or package.
  2. Since they’re compiled in the database they will run orders or magnitude faster than the corresponding queries requested by a client application. The important concept here is called context switching. In short, it turns out all those trips back and forth the database tends to slow things down. It’s much much quicker to bundle up related queries in a procedure and make one call to the procedure. I once had an argument with a Java developer about result set sorting. He was convinced that it was much faster to sort a list of objects in Java than it would be to have the database do the ORDER BY and return the results. I like the guy, but that’s just insane. The overhead of fetching each of those rows and then doing some lame bubble sort on them is astronomical. But this is the kind of thinking that infests the server side community. It’s borne out of ignorance, sometimes willful, of what a database can do.
  3. Another benefit of being in the database – they can be used by any client, not just ones written in Java (or PHP, etc.). When they talk about code reuse Java developers apparently don’t consider these kinds of issues. I’m sure it’s a wonderful learning experience to write a shipping cost calculator in Java, PHP, and JavaScript, but wouldn’t it make more sense to write it in PL/SQL once and then use it everywhere? Just a thought.
  4. Believe it or not, most of the good DBAs I’ve worked with prefer complex logic to be wrapped up somewhere they can keep an eye on it. Remember, if something breaks at 3am they’re the ones that will get paged. Having all that business logic tucked away in a jar file somewhere makes then nervous. And when things do go bad they can help a lot more when the code is in the database. It’s better for everyone.
  5. It takes about 15 minutes to learn enough PL/SQL to export some logic to the database. Sure, PL/SQL goes deeper than that, but any curly-brace type programmer should be able to absorb the concepts easily.

Now, I’m not going to say it’s all win-win. Moving business logic into the database has a dramatic effect on system design. You’ll find a lot less justification for something like Hibernate, for instance (ok, maybe that’s a win). I’ve been through this a couple of times, I know it’s hard to find the appropriate place to draw the line in terms of what gets moved into the database. Should you go balls to the wall and have the database return cursors for select statements? I usually don’t, but I have in some instances. Should every insert/update/delete be wrapped in a stored proc? Again, not an easy call.

In my most recent fantasy baseball app, I let the client code only insert into temporary tables, and then called a stored proc to validate the data and move it into the destination table. People look at me like I’m crazy when I tell them about this. But you know what? I’d do it again. If you buy the premise that inserts are dangerous because Java code can’t type-check them, then it’s the right way to do it. Temporary tables and stored procedures are much easier to change than Java code at most “serious” companies. It’s a matter of necessity to do it that way.

Hopefully I’ve covered the “whys” of PL/SQL convincingly enough. In a future post I’ll cover some basics of the “hows”.

2 comments

The Essentials of Obsessive Backups

Rounding out a small diversion down the path of personal data backup, I thought I would document my backup philosophy and scheme. Now granted, most would think I’m absolutely over the top for the intricate plan I’ve devised over the years. Suffice it to say, I’ve thought about these details a lot and finally feel like I’m at the sweet spot between data availability and data security.

That last point is important. Your data could be replicated across every machine on the planet making it very available, but obviously very insecure. I take the challenge of finding the correct balance very seriously.

The first pillar of the philosophy is to isolate the data that should be backed up from the data that doesn’t need to be backed up. Typically the first thing I do when I get a new machine is partition into 3 or 4 drives. The C drive is left to anything that was pre-installed (operating system, shareware, etc.). I leave some extra space as a buffer here because some apps insist in being installed on C or create temporary files that live in the C drive. The D drive is for applications I’ve installed with the exception of games. And all data, regardless of what application it’s from, goes to the E drive. Usually games and pictures (12 gigs and counting) go to the F drive.

Over the years this isolation has worked in my favor a couple of times. There were times that I had to re-install the OS and was thrilled to find my E drive with all data still intact. There were times a bad game install hosed the F drive but left the other untouched. In short, drive partitioning is a must. In ancient times, the process was a little harrowing and not to be done carelessly. It’s gotten a lot easier and safer now, so there’s no excuse.

The next pillar is that backups must be automated. A backup that is not automated is almost useless, as you’ll probably do it for the first couple of weeks and then quickly lose interest. There are a ton of applications that can help with this task. I rely on a mix of SyncBack and rsync, depending on the target of the backups (more on this below).

The third pillar is having a reliable, simple, accessible offsite backup. It must be reliable for obvious reasons. It must be simple because a complicated interface or API (I’m looking at you A3) only makes it less likely that I’ll work through the frustrations when things go wrong. It must be accessible so I can get my data from any machine at any time. And it must be offsite because a fire or theft could easily compromise my home machine. I found all 4 of these with rsync.net. I could write endlessly about the majesty of rsync.net. But I’ll summarize to these short points:

  • I don’t have to install any proprietary client-side apps, such as the ones iBackup or others make you install. This is one obstacle to data accessibility that is removed.
  • Since it supports SFTP, SCP, rsync, unison, and subversion, it will work on either a PC, Mac, or *nix machine. Another obstacle removed.
  • It’s cheap. Not as cheap as A3, but pretty cheap ($1.60/gig)
  • They have great customer support, with a privacy policy that puts the customer first
  • Since they support rsync (and the others listed above), they are very developer-friendly. Since it supports SFTP, I can use a client like WinSCP if I want a GUI

Obviously this isn’t for everyone. I wouldn’t suggest it for my Aunt Millie, but for me it’s about as good as it gets.

With those pillars in place, I’ve set up the following backup scheme:

  1. Core data, including Quicken files, Word docs, and source code gets backed up to rsync.net every night. Additionally, the Quicken file is encrypted using TrueCrypt for additional security.
  2. Pictures get backed up to a Dreamhost account, which gives me plenty of space to spread out. Additionally, I’ve hacked Plogger to display the photos, making this account double as a photo gallery for friends and family. Since this data isn’t critical, it’s not important to me if it gets compromised for some reason.
  3. Core data from rsync.net is also backed up to a USB key I keep on my keychain. This provides additional data accessibility while incurring no additional security risk since the entire set of data is encrypted with TrueCrypt.
  4. Most recently I purchased a $60 USB hard drive that is connected to my home machine. This backs up all data and photos every hour. The reason for this is that in the case of data loss it would be a lot easier to restore from the USB drive than from downloading from rsync.net or Dreamhost. Also, it provides a clear data transfer path when the time comes to move to a new machine.
  5. All the data on my E drive is also kept in a Subversion repository. Data versioning is a little different than backup. The goal here is to make sure that if some file becomes corrupted I could roll back to a previous state. This is not ensured by most backup schemes, where only 1 version of each file is kept. The subversion repository also happens to be backed up to both rsync.net, the USB key, and the USB harddrive. Again, just in case.

I feel good about the logic here, but I’m constantly thinking about whether I’ve done too much or not enough. Admittedly, that’s obsessive.

0 comments

Automatically Backup Your Data from Online Services (Part II)

In my previous post I advised that if you must use an online service, make sure the service offers a means to export your data so you can back it up. I wrote mydump.pl (read source, download) as a means to automatically extract my data from the web sites I used frequently. The first two candidates were probably Bloglines and Furl, although I don’t use either of those any more.

I designed the script to expect any number of “jobs” as I called them. A job might be to get your bookmarks from del.icio.us, or to get a dump of a local MySQL database, or to send the contents of the script itself in case I updated it during the day (mindblowing….wrap your head around that). The jobs can be seen at the top of the file.

In most cases, I use wget to get remote files. It’s tailor-made for this kind of application. For instance, online services typically require that you be logged in to export your data (a reasonable request). They determine you are logged in by checking the cookies you pass them in the request. So once you figure out what cookies a site sets to determine whether you are logged in you can copy those cookies and pass them to wget&nbspwith the “–header” parameter. (In the couple of years running the script I’ve never had to update the cookie values, which probably says more about the login policies of large internet sites than anything else.)

Once the script has compiled all the data from the disparate services it emails me the updates. Since I only want emails when some of the data has changed, I instituted a quick check on the content of the data retrieved from each service. After I download the data I run a hash algorithm (sha1) on the data. The hash is compared to the sha1 of from the previous run, which is stored on the filesystem. If the hash values match I know there hasn’t been any changes to the data and it can be ignored (i.e. not emailed). If the values are different I can assume there is a change and mail out the file, writing the new hash value to a file for comparison during the next run. (See the “get_old_digest”, “get_new_digest”, and “write_digest” routines). I chose to do it this way so I wouldn’t need to store a copy of the data itself on my web server, which could potentially be compromised. Since the sha1 reduces a large file to a small hash, it’s efficient in terms of data storage and easy to use in string comparison. And even if there are false positives every once in a while t’s not a huge deal. The worst that will happen is that I get a copy of a file when I really didn’t need to.

Each job must have a unique name. The name is used as a key in a nested hash table (e.g. “bloglines”). Each job can can have a number of options associated with it.

  • command – the command that is used to retrieve the data (required). This can be anything that Perl can execute, including system commands (e.g. wget, cat, mysqldump).
  • outfile – what the name of the file should be when it’s attached to the email.
  • zipfile – used in addition to “outfile”, this command tells mydump.pl to zip up the output file before attaching it to the email and specifies what the name of the zipped file should be.
  • filter – Something I had to account for is that the data frequently has timestamps in it that represents when the data was requested. Since this is different each time the data is requested the hash would always determine that the contents had changed. The script will ignore any lines in the data that match the value of the “filter” option before comparing the data from the current run to the data from the previous run.

The script relies on Digest::SHA1 and MIME::Lite, which should be installed on most hosting accounts. I have the script on my hosting account and use cron to run the script nightly. If your hosting provider doesn’t allow command line access or you’re not sure how to do this, look through the control panel for an equivalent interface.

The “GLOCAL VARIABLE DECLARATION” section has a number of options to customize. For instance, you can set “$test_only” to 1 if you want to see what the run would look like but not send the email. One last trick is that if delete all the “_digest.txt” files in the $output_path the script will assume you’re running it for the first time and send you the results of all the jobs. This is useful if you lost track of the most recent version of each job and want to catch up in one shot.

I hope you find the script useful.

0 comments

Automatically Backup Your Data from Online Services (Part I)

I am fanatical about backups. It borders on obsession. It didn’t stem out of any major data loss, it stems out of the fear of a data loss, which I guess is about the same but with more paranoia. This posed a unique issue with the advent of Web 2.0 applications, where the data is frequently stored on somebody else’s server. It took some time to work out a system that worked, but I’ve gotten it down to something of a science now and thought it’d be worthwhile to share.

There are a ton of useful services out there, but keep in mind that it’s your responsibility, not theirs, to make sure you have your data backed up. Services go out of business, change owners, have downtimes, go premium, etc. A little thought up front saves you from a frantic weekend of cutting and pasting screen fulls of data from an old service into the new one.

Below are a few rules I live by. As a preface, if you don’t have a hosted account, get one. They’re dirt cheap in most cases, and quickly becoming nearly essential. My hosted account is bang-for-the-buck the most useful service I pay for monthly. I’ve used A Small Orange for a number of years now and can highly recommend them. (If you happen to decide to use them, please consider thanking me by putting “www.vdibart.com” in the referrer box on the order form :)

  1. Always prefer a cloned or good-enough version of software that can be installed in a hosted web account. For instance, Basecamp is a great application. But did you know there’s a pretty good knock-off called ActiveCollab that was free until version 0.7.1? You can probably scrounge up a copy of that version still (wink wink nudge nudge). Even if you have to pay the $99 for a perpetual license for version 1.0.4, in my mind it’s still better to access and control over all your data.
  2. If you can’t find a hosted version, make sure the online service you select provides a means to export your data. Most of the big players like Google and Yahoo allow you to get backups of your data from inside the web application. If you know what you’re doing, you might want to make sure their service is compatible with something like curl or wget so you can call it from a script, which leads me to…
  3. Create a script to automatically pull all your data from each service. I’m a big believer in the motto that backups should run automatically otherwise they’re probably useless. I have just so happened to create a Perl script to backup my data from the various online applications. The script runs every night on my hosted account and emails me the results. From there the possibilities are endless. For instance, if I sent to my Google account I could keep them indefinitely and have implicit ability to search for a particular version. I choose to just copy them to my hard drive and use Subversion to keep them versioned. The important thing is collating the data from the various services in one place in an automated fashion.

This post turned out a little longer than I expected, so I’ll plan to cover the actual script in the next post.

0 comments

Sets – When PK’s Are Too Much

One design problem that trailed around behind me for years was how to handle the situation where the relationship between two elements doesn’t nicely fit the traditional role of a primary key/foreign key relationship. For example, let’s say you created a cool product like Scrabulous and you want to model the players participating in a game. It would be straightforward to model it as:


CREATE TABLE games
(id NUMBER PRIMARY KEY
,status_id NUMBER NOT NULL REFERENCES lk_status(id)
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)


CREATE TABLE game_players
(game_id NUMBER REFERENCES games(id)
,user_id NUMBER REFERENCES users(id)
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)

Nothing tricky here – a design like this will do the trick 99% of the time. There will be times when you have to look up a game by its status, or find all games created in the past 30 days (e.g. for reporting purposes), or find all games with a particular user_id in it. A “game” and a “game_player” are concrete, first-class entities in your system.

The optimal design is a little less clear when one part of the relationship is not truly an meaningful, independent entity. For example, let’s say you are building a sports application and need a way to model a matchup of 2 teams. Let me walk you through the evolution of the design. The first attempt might be:


CREATE TABLE matchups
(id NUMBER NOT NULL PRIMARY KEY
,home_team_id NUMBER NOT NULL REFERENCES teams(id)
,away_team_id NUMBER NOT NULL REFERNCES teams(id)
,matchup_date DATE NOT NULL
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)

This design sucks. If want to find all the matchups for a particular team you’d have to inspect both the home_team_id column and the away_team_id column. Queries using an OR aren’t as bad as outer joins, but they aren’t great and should be avoided when possible. So, next iteration. Instead of trying to cram it all into 1 row, you decide to explode it so that a single matchup is described by 2 rows such as this:


CREATE TABLE matchups
(id NUMBER NOT NULL PRIMARY KEY
,team_id NUMBER NOT NULL REFERENCES teams(id)
,is_home_team CHAR(1) NOT NULL
,matchup_date DATE NOT NULL
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)

A little better, but something’s not quite right. A single matchup between 2 teams now produces 2 rows, but the rows aren’t linked in any way. In other words, it’s near impossible to find both sides of a matchup because nothing (other than possibly matchup_date) associates them. So in the next iteration we introduce the concept of a “matchup set”.


CREATE TABLE matchups
(id NUMBER NOT NULL PRIMARY KEY
,matchup_set NUMBER NOT NULL
,team_id NUMBER NOT NULL REFERENCES teams(id)
,is_home_team CHAR(1) NOT NULL
,matchup_date DATE NOT NULL
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)

So we can now identify both teams in a matchup because their matchup_set are the same:


SELECT m.id, m.is_home_team
FROM matchups m
WHERE m.matchup_set = 12345

This would return 2 rows. At this point, your refactoring senses start tingling. Every matchup set would implicitly have the same MATCHUP_DATE. Wouldn’t it be better to have a parent table called…well, not sure really what to call it. Guess you’d have to change the MATCHUPS table to be MATCHUP_DETAILS, and then create a new MATCHUPS table. Extracting out MATCHUP_DATE you are left with this design:


CREATE TABLE matchups
(id NUMBER NOT NULL PRIMARY KEY
,matchup_date DATE NOT NULL
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)


CREATE TABLE matchup_details
(matchup_id NUMBER NOT NULL REFERENCES matchups(id)
,team_id NUMBER NOT NULL REFERENCES teams(id)
,is_home_team CHAR(1) NOT NULL
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)

Doesn’t seem like a big win though. Now we have 2 tables, which creates an extra join in most situations:


SELECT m.id, m.matchup_date, md.team_id, md.is_home_team
FROM matchups m, matchup_details md
WHERE m.id = 12345
AND m.id = md.matchup_id

Now here’s the big IF: if you will ever need to reference a matchup in another table, the last iteration is the correct design. But, again, in that situation a matchup implicitly represents a first-class entity. For those situations where the matchup is a loose association of teams, though, you should consider the benefits of the previous iteration.

In a sense, this pattern inverts the last iteration’s design. Instead of teams belonging in a matchup, the matchup is a property of 2 different teams. As usual, this flexibility comes with benefits and costs. When used correctly you gain a number of things:

  1. fewer joins – there’s no parent table to have to join to all the time
  2. fewer inserts – again, no parent table to have to maintain
  3. #1 and #2 combine to reduce the possibility for your data to become corrupt
  4. there’s less meaningless data in the system – the parent MATCHUP table is mostly cruft

For this you trade the ability to use the matchup set as a foreign key. In some situations, that’s not only perfectly acceptable, it’s preferable.

0 comments

Outer Joins are Evil

I hate outer joins. Hate them. Hate writing them, hate tuning them, hate jumping through hoops to avoid them. They suck up processing power, they turn otherwise efficient queries into performance pigs, they look ugly, they’re hard to read, and they generally make life more difficult than it needs to be.

Contrary to previous plans, I won’t rehash what an outer join is. The internet being what it is, there are plenty of resources to explain the concept to you, most of which will do a better job than I could. But I would like to present some design techniques I’ve used to avoid outer joins. But before I get into those, I want to make this point crystal clear. Outer joins are evil and you should do whatever is within your power to avoid them. I have gone to some extreme lengths in my designs to avoid outer joins, and every single time the effort has paid off. I recommend you do the same.

Technique #1: Always Populate Column Values

Sometimes avoiding an outer join is as simple as making sure the column always has a value. Might seem straightforward, but I’ve seen many designs that don’t make use of this. Let’s assume you have a table that holds information about some pretty pictures that your users have uploaded into your system. One of the fields you want to track is the status of the picture – whether it has been approved for display (e.g. it’s not naughty). Your schema might look like this:

CREATE TABLE lk_picture_status
(id NUMBER NOT NULL PRIMARY KEY
,status_name VARCHAR2(20) NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

CREATE TABLE pretty_pictures
(id NUMBER NOT NULL PRIMARY KEY
,picture_status_id NUMBER NOT NULL REFERENCES lk_picture_status(id)
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

The goal is to ensure that picture_status_id is never null (done here by making the column NOT NULL), otherwise you will have to use an outer join in any queries that fetch the pictures based on status. This could be done in the server-side code that inserts the row or (more preferably) via a column default value or before insert trigger. The latter two have the additional benefit of automatically filling in the column’s value when the row is inserted regardless of what code performs the insert.

Technique #2: Avoid 1:1 Relationship Tables

Generally it’s considered bad practice to design tables with a 1:1 relationship. I’ve seen some variations on the following theme in various places:

CREATE TABLE users
(id NUMBER NOT NULL PRIMARY KEY
,name VARCHAR2(200) NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

CREATE TABLE user_details
(user_id NUMBER NOT NULL REFERENCES users(id)
,email VARCHAR2(200) NOT NULL
,city VARCHAR2(200)
,state VARCHAR2(20)
,zip VARCHAR2(20)
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

Where for every row in users there can be 0 or 1 row in user_details. Leaving behind the logic of why you would do that for now, this situation can easily lead to the use of outer joins if user_details does not contain a row for every row in users.

In truth, developers will usually make the the inserts into those tables part of the same transaction, and in most cases this will keep the tables sufficiently in sync. But every once in a while an exception/bug occurs and the child record never gets written. Frequently this become quite a headache to track down and fix because developers never stop to consider that the child record might not have been inserted. So keeping in mind my mantra that data will alway find a way to get corrupted, the best design is to avoid tables with a 1:1 relationship. This has the dual benefit of avoiding the possible data corruption issue while removing any need for an outer join. In the above example, there’s no reason why the relevant columns in the child table can’t be included in the parent table.

An Acceptable (if Unavoidable) Use

Sometimes the lengths required to remove an outer join isn’t worth the effort. Here’s a set of tables that illustrates that point:

CREATE TABLE admins
(id NUMBER NOT NULL PRIMARY KEY
,name VARCHAR2(200) NOT NULL
,is_active CHAR(1) default ‘y’ NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

CREATE TABLE pretty_pictures_2
(id NUMBER NOT NULL PRIMARY KEY
,was_banned CHAR(1) default ‘n’ NOT NULL
,banning_admin_id NUMBER REFERENCES admins(id)
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

So a row in pretty_pictures_2 can be banned by an admin, at which point the admin’s admins.id value is placed on the appropriate record in the pretty_pictures table. So if you wanted to report all pictures uploaded within the current day and show the banning admin (if it applicable) you would use:

SELECT pp.id, pp.message, pp.was_banned, nvl(a.name, ”)
FROM pretty_pictures_2 pp, admins a
WHERE pp.banning_admin_id (+) = a.id
AND trunc(pp.create_ts) = trunc(sysdate)

It’s kind of hard to avoid this situation cleanly. You could add a bogus admin in the admins table (e.g. admin_id = 0) that is interpreted to mean that no admin is associated with the record, and thus it’s not actually banned. Then you could safely set a default value on the column as described in Technique #1 above. This is undeniably hackish and probably not worth the effort. Similarly, there’s no meaningful way to adapt Technique #2 in this situation. So you’ll have to let the outer join stand. Luckily I’ve found that this happens infrequently, and when it does it’s on relatively simple queries that don’t get executed often enough to make it on the DBA’s list of naughty queries.

2 comments

Reduce Complexity with Self-Expiring Data Columns

I thought I’d start off with a cool database design trick that I find use for frequently. Here’s the setup. You have a piece of information that you need to expire after some period of time. For example, you might send someone a message that they have to act on in 10 days or the message disappears. Here’s what relevant table might look like (Oracle syntax):

CREATE TABLE messages
(id NUMBER NOT NULL PRIMARY KEY,
,from_user_id NUMBER NOT NULL REFERENCES users(id)
,to_user_id NUMBER NOT NULL REFERENCES users(id)
,message VARCHAR2(2000) NOT NULL
,is_expired CHAR(1) default ‘n’ NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

The brute force method is to have a separate backend process that goes through every night and expires all messages that meet the following criteria:

SELECT m.id
FROM messages m
WHERE m.create_ts <= sysdate-10

Then if you want to find all unexpired messages for a particular user you would use

SELECT m.id
FROM messages m
WHERE m.to_user_id = 1234
AND m.is_expired = ‘n’

All is great, until one night your process fails, and the developer who wrote the process is on vacation, and chaos ensues. You have to rerun the process, but no one’s sure how. You know how it goes. Moving parts have a tendency to break.

My preferred solution for this reworks the data model and the query to put the burden of work on the query, removing the need for the backend process. It’s not as bad as it sounds. Here’s the reworked table definition:

CREATE TABLE messages
(id NUMBER NOT NULL PRIMARY KEY,
,from_user_id NUMBER NOT NULL REFERENCES users(id)
,to_user_id NUMBER NOT NULL REFERENCES users(id)
,message VARCHAR2(2000) NOT NULL
,expires_ts DATE NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

Notice that is_expired changes to expires_ts. So instead of having a flag to say whether the message is expired, we get a little less literal about it – a message is expired when expires_ts is in the past. One goal is met – there’s no need for a nightly process to expire the messages. If you want to find all the unexpired messages for a particular user you would use

SELECT m.id
FROM messages m
WHERE m.to_user_id = 1234
AND m.expires_ts < sysdate

Not significantly heavier for the database to process as long as there’s an index on to_user_id. Here’s a quick summary of the pros and cons at work here:

Pros:

  • Reduction of moving parts – no separate process needed to expire items manually
  • We gain system meta data – previously we only knew whether an item was expired or not. Now we know when it was expired. If you throw in an index on expires_ts, you can easily track expired items across the entire system. For the price of changing the data type of one column and throwing in an index we gain more insight into our application – I love that kind of thing
  • The code entering the data can change the expiration date according to business rules built into the system. Here’s some homework – sketch out how the backend process (and the system in general) would have to change if the data in the MESSAGES table was intended to have different expiration dates – for example if some messages were supposed to expire in 10 days and other in 15. The self-expiring data pattern avoids that mess elegantly
  • Remove the forced and unsavory reliance on create_ts. For example, think about what must happen if the business folks request that a message’s expiration time can be extended so that it will expire in 20 days instead of 10. How does the brute force method handle this? By changing the create_ts value? By relying on update_ts instead? By entering a new row? None of those options are great.

Cons:

  • All queries against the MESSAGES table have to check the expires_ts column. This can lead to a lot of bugs for those unfamiliar with the data model.

Something to Consider:

I always consider the case of what would happen if a batch of rows would have to be expired in bulk (say 100,000 rows). In the brute force model it’s easy – set is_expired to “y” for those rows. For the self-expiring data pattern you have to find the rows and update the expired_ts to some date more than 10 days ago. I have some objections to this since you are altering data in a way that makes it invalid. It didn’t really expire 11 days ago, you are just saying it did so it doesn’t show up on the live site any more. Reports on that data might now include it erroneously, etc. This could be a minor point or a large problem depending on your system requirements.

Something Not To Consider:

More isn’t better. I do not recommend putting both an is_expired and a expires_ts column on the table because it’s asking for trouble when the data inevitably gets corrupted. In other words, what happens if somehow the is_expired gets set to “y” but the expires_ts is in the future? What should the code do? What about the guy who’s trying to track down a bug and isn’t sure if that’s normal? It’s bad news. Avoid it.

I think it’s clear that the trade off is worth it in almost all cases – the self-expiring data pattern is incredibly useful and certainly a huge improvement over the brute force method.

4 comments

« Previous Page