Reduce Complexity with Self-Expiring Data Columns

I thought I’d start off with a cool database design trick that I find use for frequently. Here’s the setup. You have a piece of information that you need to expire after some period of time. For example, you might send someone a message that they have to act on in 10 days or the message disappears. Here’s what relevant table might look like (Oracle syntax):

CREATE TABLE messages
(id NUMBER NOT NULL PRIMARY KEY,
,from_user_id NUMBER NOT NULL REFERENCES users(id)
,to_user_id NUMBER NOT NULL REFERENCES users(id)
,message VARCHAR2(2000) NOT NULL
,is_expired CHAR(1) default ‘n’ NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

The brute force method is to have a separate backend process that goes through every night and expires all messages that meet the following criteria:

SELECT m.id
FROM messages m
WHERE m.create_ts <= sysdate-10

Then if you want to find all unexpired messages for a particular user you would use

SELECT m.id
FROM messages m
WHERE m.to_user_id = 1234
AND m.is_expired = ‘n’

All is great, until one night your process fails, and the developer who wrote the process is on vacation, and chaos ensues. You have to rerun the process, but no one’s sure how. You know how it goes. Moving parts have a tendency to break.

My preferred solution for this reworks the data model and the query to put the burden of work on the query, removing the need for the backend process. It’s not as bad as it sounds. Here’s the reworked table definition:

CREATE TABLE messages
(id NUMBER NOT NULL PRIMARY KEY,
,from_user_id NUMBER NOT NULL REFERENCES users(id)
,to_user_id NUMBER NOT NULL REFERENCES users(id)
,message VARCHAR2(2000) NOT NULL
,expires_ts DATE NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

Notice that is_expired changes to expires_ts. So instead of having a flag to say whether the message is expired, we get a little less literal about it – a message is expired when expires_ts is in the past. One goal is met – there’s no need for a nightly process to expire the messages. If you want to find all the unexpired messages for a particular user you would use

SELECT m.id
FROM messages m
WHERE m.to_user_id = 1234
AND m.expires_ts < sysdate

Not significantly heavier for the database to process as long as there’s an index on to_user_id. Here’s a quick summary of the pros and cons at work here:

Pros:

  • Reduction of moving parts – no separate process needed to expire items manually
  • We gain system meta data – previously we only knew whether an item was expired or not. Now we know when it was expired. If you throw in an index on expires_ts, you can easily track expired items across the entire system. For the price of changing the data type of one column and throwing in an index we gain more insight into our application – I love that kind of thing
  • The code entering the data can change the expiration date according to business rules built into the system. Here’s some homework – sketch out how the backend process (and the system in general) would have to change if the data in the MESSAGES table was intended to have different expiration dates – for example if some messages were supposed to expire in 10 days and other in 15. The self-expiring data pattern avoids that mess elegantly
  • Remove the forced and unsavory reliance on create_ts. For example, think about what must happen if the business folks request that a message’s expiration time can be extended so that it will expire in 20 days instead of 10. How does the brute force method handle this? By changing the create_ts value? By relying on update_ts instead? By entering a new row? None of those options are great.

Cons:

  • All queries against the MESSAGES table have to check the expires_ts column. This can lead to a lot of bugs for those unfamiliar with the data model.

Something to Consider:

I always consider the case of what would happen if a batch of rows would have to be expired in bulk (say 100,000 rows). In the brute force model it’s easy – set is_expired to “y” for those rows. For the self-expiring data pattern you have to find the rows and update the expired_ts to some date more than 10 days ago. I have some objections to this since you are altering data in a way that makes it invalid. It didn’t really expire 11 days ago, you are just saying it did so it doesn’t show up on the live site any more. Reports on that data might now include it erroneously, etc. This could be a minor point or a large problem depending on your system requirements.

Something Not To Consider:

More isn’t better. I do not recommend putting both an is_expired and a expires_ts column on the table because it’s asking for trouble when the data inevitably gets corrupted. In other words, what happens if somehow the is_expired gets set to “y” but the expires_ts is in the future? What should the code do? What about the guy who’s trying to track down a bug and isn’t sure if that’s normal? It’s bad news. Avoid it.

I think it’s clear that the trade off is worth it in almost all cases – the self-expiring data pattern is incredibly useful and certainly a huge improvement over the brute force method.

Be Sociable, Share!
4 comments

4 Comments so far

  1. Jason on December 6th, 2007

    Dog,

    What about a TTL column in conjunction with the CREATE_TS, instead of EXPIRED_TS? That would answer your homework problem of messages differing expiration times. If you wanted to change all messages that lived for ten days to twenty you wouldn’t have to look at CREATE_TS to do so. Display code would add TTL to CREATE_TS and check against sysdate. would also improve performance the bulk update. There are pros and cons to this but I guess the only major advantage to the TTL way would be for clarity in the varying expiration logic.

    – Jason

  2. vdibart on December 6th, 2007

    If I understand you properly…..logically, I think this is the same as using just EXPIRES_TS. In other words:

    TTL + CREATE_TS = EXPIRES_TS

    So in effect this is an alternate formulation of the idea, and I think just as valid. One benefit I can think of is that you probably don’t have to index TTL, so if you already have CREATE_TS indexed you’re saving yourself on index space and upkeep. One drawback is that it’s not very self-descriptive – the developer would have to know that TTL is applied to CREATE_TS, and not some other column like UPDATE_TS (which I always have in my tables….more on that later)

    Thanks for the suggestion though.

  3. charles on December 19th, 2007

    Hey Vinny, thanks for the tip. This definitely sounds like the right method for self-expiring data. But I think you would still need a background process if you want to take action at the time of expiration (e.g. sending an email notification). Are there any other ways to do that?

  4. vdibart on December 20th, 2007

    You are right – you will always need a background process if you want to perform some other action when it expires. This pattern doesn’t address that issue, only the issue of whether you need a background process to actually do the expiring. But keep in mind that the background process itself can construct the query as outlined above. If you think about this for a while, you will realize that the pattern naturally ensures that you can rerun that process any number of times without harming data integrity. But that might be a topic for another post 🙂

Leave a Reply