Archive for December, 2007

Dangerous Style

Slow week, so thought I’d vent. The top 3 things that seem like good ideas at the time:

  1. Using IDs in WHERE clauses when a CHAR or VARCHAR column could be used. This is another of those pesky “proper design” practices. As a developer, you should always assume that a row’s ID can change at any time. This is most tempting when joining to lookup tables, where you think the IDs will never change. That might be true in the development environment, but maybe the IDs turn out to be different in the production environment for any number of reasons. All your SQL breaks. It’s always safer to use a unique identifying string value in joins where possible. And if your lookup tables weren’t designed with a unique key on a string column, you should create one. It’s much much safer to do:

    SELECT p.name FROM people p, groups g WHERE p.group_id = g.id AND g.name = ‘friends’

    than it is to do:

    SELECT p.name FROM people p WHERE p.group_id = 1

    At that rate, why have a lookup tables there in the first place? (No, I’m not really suggesting this, although I have worked on systems where this was commonplace).

  2. Misuse/Overuse of IN clauses. When I’m writing ad-hoc queries I tend to throw around the IN clause liberally because it’s usually quicker than a join. In production code, you should severely restrict your use of it. For one, it’s not the most effiecient clause for the database to execute. For another, it’s error prone.Here’s the wrong way to use an IN clause:

    SELECT p.name FROM people p WHERE p.group_id in (SELECT id FROM groups WHERE strangers = ‘n’)

    I guess some developers have the impression that it’s easier for the DBMS to optimize that because it looks like 2 separate queries squashed together. In most non-trivial cases it’s not easier to optimize. Here’s the right way to use an IN clause:

    SELECT p.name FROM people p, groups g WHERE p.group_id = g.id AND g.strangers in (‘n’, ‘notsure’)

    In most correct uses, it could be conceptually replaced with one or more OR clauses. So lay off the IN clauses please.

  3. SELECT * FROM …. queries. There is no reason to ever SELECT * from any table. No reason. It’s a lazy, horrible practice. Spend another 3 minutes and protect yourself from the bugs that will eventually arise when someone adds or modifies a column to that table. If you’re really too lazy to type out the column names, you can do something like this:

    SELECT column_name FROM user_tab_columns WHERE table_name = ‘MESSAGES’

    The table USER_TAB_COLUMNS is part of Oracle’s data dictionary, which I’m hoping to cover more in future posts. That’s just one way to be lazy and productive at the same time.

9 comments

Outer Joins are Evil

I hate outer joins. Hate them. Hate writing them, hate tuning them, hate jumping through hoops to avoid them. They suck up processing power, they turn otherwise efficient queries into performance pigs, they look ugly, they’re hard to read, and they generally make life more difficult than it needs to be.

Contrary to previous plans, I won’t rehash what an outer join is. The internet being what it is, there are plenty of resources to explain the concept to you, most of which will do a better job than I could. But I would like to present some design techniques I’ve used to avoid outer joins. But before I get into those, I want to make this point crystal clear. Outer joins are evil and you should do whatever is within your power to avoid them. I have gone to some extreme lengths in my designs to avoid outer joins, and every single time the effort has paid off. I recommend you do the same.

Technique #1: Always Populate Column Values

Sometimes avoiding an outer join is as simple as making sure the column always has a value. Might seem straightforward, but I’ve seen many designs that don’t make use of this. Let’s assume you have a table that holds information about some pretty pictures that your users have uploaded into your system. One of the fields you want to track is the status of the picture – whether it has been approved for display (e.g. it’s not naughty). Your schema might look like this:

CREATE TABLE lk_picture_status
(id NUMBER NOT NULL PRIMARY KEY
,status_name VARCHAR2(20) NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

CREATE TABLE pretty_pictures
(id NUMBER NOT NULL PRIMARY KEY
,picture_status_id NUMBER NOT NULL REFERENCES lk_picture_status(id)
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

The goal is to ensure that picture_status_id is never null (done here by making the column NOT NULL), otherwise you will have to use an outer join in any queries that fetch the pictures based on status. This could be done in the server-side code that inserts the row or (more preferably) via a column default value or before insert trigger. The latter two have the additional benefit of automatically filling in the column’s value when the row is inserted regardless of what code performs the insert.

Technique #2: Avoid 1:1 Relationship Tables

Generally it’s considered bad practice to design tables with a 1:1 relationship. I’ve seen some variations on the following theme in various places:

CREATE TABLE users
(id NUMBER NOT NULL PRIMARY KEY
,name VARCHAR2(200) NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

CREATE TABLE user_details
(user_id NUMBER NOT NULL REFERENCES users(id)
,email VARCHAR2(200) NOT NULL
,city VARCHAR2(200)
,state VARCHAR2(20)
,zip VARCHAR2(20)
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

Where for every row in users there can be 0 or 1 row in user_details. Leaving behind the logic of why you would do that for now, this situation can easily lead to the use of outer joins if user_details does not contain a row for every row in users.

In truth, developers will usually make the the inserts into those tables part of the same transaction, and in most cases this will keep the tables sufficiently in sync. But every once in a while an exception/bug occurs and the child record never gets written. Frequently this become quite a headache to track down and fix because developers never stop to consider that the child record might not have been inserted. So keeping in mind my mantra that data will alway find a way to get corrupted, the best design is to avoid tables with a 1:1 relationship. This has the dual benefit of avoiding the possible data corruption issue while removing any need for an outer join. In the above example, there’s no reason why the relevant columns in the child table can’t be included in the parent table.

An Acceptable (if Unavoidable) Use

Sometimes the lengths required to remove an outer join isn’t worth the effort. Here’s a set of tables that illustrates that point:

CREATE TABLE admins
(id NUMBER NOT NULL PRIMARY KEY
,name VARCHAR2(200) NOT NULL
,is_active CHAR(1) default ‘y’ NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

CREATE TABLE pretty_pictures_2
(id NUMBER NOT NULL PRIMARY KEY
,was_banned CHAR(1) default ‘n’ NOT NULL
,banning_admin_id NUMBER REFERENCES admins(id)
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

So a row in pretty_pictures_2 can be banned by an admin, at which point the admin’s admins.id value is placed on the appropriate record in the pretty_pictures table. So if you wanted to report all pictures uploaded within the current day and show the banning admin (if it applicable) you would use:

SELECT pp.id, pp.message, pp.was_banned, nvl(a.name, ”)
FROM pretty_pictures_2 pp, admins a
WHERE pp.banning_admin_id (+) = a.id
AND trunc(pp.create_ts) = trunc(sysdate)

It’s kind of hard to avoid this situation cleanly. You could add a bogus admin in the admins table (e.g. admin_id = 0) that is interpreted to mean that no admin is associated with the record, and thus it’s not actually banned. Then you could safely set a default value on the column as described in Technique #1 above. This is undeniably hackish and probably not worth the effort. Similarly, there’s no meaningful way to adapt Technique #2 in this situation. So you’ll have to let the outer join stand. Luckily I’ve found that this happens infrequently, and when it does it’s on relatively simple queries that don’t get executed often enough to make it on the DBA’s list of naughty queries.

2 comments

Proper Use of Check Constraints and Enums

Something that comes up from time to time is the proper use of Oracle column check constraints or MySQL enum type columns. Typically the question is when it’s appropriate to use those features to enforce data integrity as opposed to having foreign keys to a lookup table.

For example, here is a table that uses MySQL enumeration values to enforce that the the log message level is either “debug”, “warning”, or “error”:

CREATE TABLE log_messages
(id INTEGER NOT NULL
,message_level ENUM(‘debug’, ‘warning’, ‘error’)
,message VARCHAR(255)
,create_ts TIMESTAMP default sysdate() NOT NULL
,update_ts TIMESTAMP)

The first consideration is whether the enumeration of message types is locked down to those three for the lifetime of the application. In the above case, that seems perfectly within the bounds of solid design practices. There might be one or two conceivable states added, but that’s what keeps developers and DBAs employed. On the other hand, if the list of types will grow gradually there’s a problem, as in the below table:

CREATE TABLE log_messages
(id INTEGER NOT NULL
,message_level ENUM(‘debug’, ‘warning’, ‘error’)
,job_name enum(‘send_emails’, ‘clean_orders’, ‘calc_revenue’)
,message VARCHAR(255)
,create_ts TIMESTAMP default sysdate() NOT NULL
,update_ts TIMESTAMP)

It should be clear that the list of jobs will gradually grow to include ones that you couldn’t conceive of when you designed the table. It’s certainly quicker to use an enumeration here, but in the long run it’s going to cause headaches. The preferred redesign involves 2 tables:

CREATE TABLE log_messages
(id INTEGER NOT NULL
,message_level ENUM(‘debug’, ‘warning’, ‘error’)
,job_id INTEGER NOT NULL
,message VARCHAR(255)
,create_ts TIMESTAMP default sysdate() NOT NULL
,update_ts TIMESTAMP)

CREATE TABLE jobs
(id INTEGER NOT NULL
,name VARCHAR(20)
,create_ts TIMESTAMP default sysdate() NOT NULL
,update_ts TIMESTAMP)

Where log_messages.job_id = jobs.id. Of course, this points out one more problem with the overuse of check constraints or enumerations. What happens if you want to reuse that column value in another table as a foreign key? For instance, you might have another table that stores the total run time for each job. You would have to add the same constraint/enum on that other table, and make sure they stay in sync as the system evolves. This duplication should raise the hackles of any experienced developer.

So the rule of thumb is that it should be used sparingly, in cases where the enumeration values are unlikely to change frequently, or in places where you are sure you’ll never need to reuse the values as foreign keys in other tables. One of my favorite places to use this is for boolean yes/no columns such as:

CREATE TABLE admins
(id INTEGER
,name VARCHAR(255)
,is_active ENUM(‘y’, ‘n’) default ‘y’ NOT NULL
,create_ts TIMESTAMP default sysdate() NOT NULL
,update_ts TIMESTAMP)

Obviously there would be little reason to extend the ‘y’ and ‘n’ values. And there would not be a meaningful reason to externalize those values so they could be reused in other tables.

Something to Consider

Keep in mind that the MySQL enum type has some other features that might come in handy, such as the ability to index each enumeration value using a number. That may also influence your decision to use it in places where you otherwise shouldn’t depending on the application
design.

0 comments

Reduce Complexity with Self-Expiring Data Columns

I thought I’d start off with a cool database design trick that I find use for frequently. Here’s the setup. You have a piece of information that you need to expire after some period of time. For example, you might send someone a message that they have to act on in 10 days or the message disappears. Here’s what relevant table might look like (Oracle syntax):

CREATE TABLE messages
(id NUMBER NOT NULL PRIMARY KEY,
,from_user_id NUMBER NOT NULL REFERENCES users(id)
,to_user_id NUMBER NOT NULL REFERENCES users(id)
,message VARCHAR2(2000) NOT NULL
,is_expired CHAR(1) default ‘n’ NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

The brute force method is to have a separate backend process that goes through every night and expires all messages that meet the following criteria:

SELECT m.id
FROM messages m
WHERE m.create_ts <= sysdate-10

Then if you want to find all unexpired messages for a particular user you would use

SELECT m.id
FROM messages m
WHERE m.to_user_id = 1234
AND m.is_expired = ‘n’

All is great, until one night your process fails, and the developer who wrote the process is on vacation, and chaos ensues. You have to rerun the process, but no one’s sure how. You know how it goes. Moving parts have a tendency to break.

My preferred solution for this reworks the data model and the query to put the burden of work on the query, removing the need for the backend process. It’s not as bad as it sounds. Here’s the reworked table definition:

CREATE TABLE messages
(id NUMBER NOT NULL PRIMARY KEY,
,from_user_id NUMBER NOT NULL REFERENCES users(id)
,to_user_id NUMBER NOT NULL REFERENCES users(id)
,message VARCHAR2(2000) NOT NULL
,expires_ts DATE NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

Notice that is_expired changes to expires_ts. So instead of having a flag to say whether the message is expired, we get a little less literal about it – a message is expired when expires_ts is in the past. One goal is met – there’s no need for a nightly process to expire the messages. If you want to find all the unexpired messages for a particular user you would use

SELECT m.id
FROM messages m
WHERE m.to_user_id = 1234
AND m.expires_ts < sysdate

Not significantly heavier for the database to process as long as there’s an index on to_user_id. Here’s a quick summary of the pros and cons at work here:

Pros:

  • Reduction of moving parts – no separate process needed to expire items manually
  • We gain system meta data – previously we only knew whether an item was expired or not. Now we know when it was expired. If you throw in an index on expires_ts, you can easily track expired items across the entire system. For the price of changing the data type of one column and throwing in an index we gain more insight into our application – I love that kind of thing
  • The code entering the data can change the expiration date according to business rules built into the system. Here’s some homework – sketch out how the backend process (and the system in general) would have to change if the data in the MESSAGES table was intended to have different expiration dates – for example if some messages were supposed to expire in 10 days and other in 15. The self-expiring data pattern avoids that mess elegantly
  • Remove the forced and unsavory reliance on create_ts. For example, think about what must happen if the business folks request that a message’s expiration time can be extended so that it will expire in 20 days instead of 10. How does the brute force method handle this? By changing the create_ts value? By relying on update_ts instead? By entering a new row? None of those options are great.

Cons:

  • All queries against the MESSAGES table have to check the expires_ts column. This can lead to a lot of bugs for those unfamiliar with the data model.

Something to Consider:

I always consider the case of what would happen if a batch of rows would have to be expired in bulk (say 100,000 rows). In the brute force model it’s easy – set is_expired to “y” for those rows. For the self-expiring data pattern you have to find the rows and update the expired_ts to some date more than 10 days ago. I have some objections to this since you are altering data in a way that makes it invalid. It didn’t really expire 11 days ago, you are just saying it did so it doesn’t show up on the live site any more. Reports on that data might now include it erroneously, etc. This could be a minor point or a large problem depending on your system requirements.

Something Not To Consider:

More isn’t better. I do not recommend putting both an is_expired and a expires_ts column on the table because it’s asking for trouble when the data inevitably gets corrupted. In other words, what happens if somehow the is_expired gets set to “y” but the expires_ts is in the future? What should the code do? What about the guy who’s trying to track down a bug and isn’t sure if that’s normal? It’s bad news. Avoid it.

I think it’s clear that the trade off is worth it in almost all cases – the self-expiring data pattern is incredibly useful and certainly a huge improvement over the brute force method.

4 comments