Archive for the 'pattern' Category


The Double-Decker Train Conductor Problem

One of the things I love about being a software developer is the fractal nature of our work. When we design a system we are almost always taking some piece of the universe and attempting to deconstruct it and model it so that it can run inside a computer. Examples of good (or bad) design are all around us, and our work demands that we draw on these examples to create a working piece of software. And software itself is nothing more than a bunch of bits and registers and some electricity that’s pretending to be more than the sum of its parts.

So I found myself reading Coders at Work on the 8:06 train the other day. I don’t usually catch the 8:06. The 8:06 is a double-decker train. And watching the conductor come through to collect our tickets I realized he represented a real-world example of a mutex.

This day there was only 1 conductor for both levels of the double-decker. It dawned on me that it would be very easy for someone to avoid having to pay by hanging out in the upper level and waiting for the conductor to collect the tickets from the lower level, then sneaking down to the lower level while the conductor moves to the upper level.

The lone conductor represented a flawed algorithm. There was no lock on the resource (exit door = I/O stream?). Adding another conductor could solve the leakage problem and lock the resource. But that would limit (or serialize) the free flow of passengers to and from the car.

I could probably go on exaggerating this example for while but I think you probably get the point.

0 comments

Wherein I Question the Usefulness of MVC

I decided to use CodeIgniter for a PHP project that I’m working on. CodeIgniter is an MVC framework, not too unlike CakePHP. At least I imagine they’re very similar, but I can’t say for sure as the reason I chose CodeIgniter over CakePHP was that the CakePHP documentation is a mess and I didn’t have time to wade through it. CodeIgniter has been fairly easy to work with so far. I’m sure there are tons of CodeIngiter reviews by developers like me out there, so I won’t bore you with that just yet (future post!).

This post is about Model-View-Controller (MVC) architecture. Like any developer, I’ve read countless retellings of why patterns and MVC are good for your code. True to form, I think those claims are overblown. I’ve worked with people that do everything “By the Book” and I’ve worked with people that hack everything together as best they can. Seeing both sides of it I honestly can’t say that one made my life any better than the other. Unstructured code, if kept reigned in to some degree, can be incredibly flexible and allow you to be agile in the face of rapidly-changing priorities.

For instance, I’m not above having SQL statements in a JSP file. I don’t love it. I try to avoid it if it’s going to get messy. But I don’t think it’s something to be embarrassed about. I can’t tell you how many times I’ve been able to move a change out in minutes rather than weeks because I was able to tweak a query in the JSP. No, it’s not “By the Book”. But it works, and in the end that’s what you get paid for.

My general rule of thumb is that the closer to the end user your code is the more flexible it has to be. Consider the following range of technologies that flow from the user end to the server side: HTML/CSS, Javascript, PHP/Java/Ruby, PL/SQL, database schema. HTML needs to be more flexible than Java, which needs to be more flexible than the database schema. So for every 1000 times you tweak your HTML or CSS, you might need to make a couple of changes to your backend Java. Sounds reasonable.

So coming back to MVC, one thing I’ve never understood is why the controller is responsible for selecting which view is invoked. This seems fundamentally flawed to me. In a language like Java the controller is a servlet compiled into a jar file somewhere. To change the behavior of that file you have to go through an entire release process: change code, test, promote to QA, test, promote to production, test. At MLB, a change like that took about 2 weeks from start to finish. (Obviously the situation is a little different if you use PHP, which is why I’ve decided to use an MVC framework for the PHP project).

In essence, it’s like the backend developers are saying “Move aside HTML, let the big boys make the call. We know better which file should be displayed”. You know what, they don’t and they shouldn’t. Yes, I know about Front Controllers. Yawn. Yes, I know you could easily write the system such that the flow through the views is configured using XML so it can be changed on the fly, as they did at MLB. Snore. Don’t get me started on XML for configuration. These are all solutions in search of a problem. These things can be done, but no one has really ever convinced me that they need to be done. Agility requires simplicity. Simplicity can’t be configured with XML.

0 comments

The request_token Pattern

The idea behind the request token is another one of those simple-but-powerful patterns that I’ve come to rely on in various systems. I’ll jump right into an example of a case where I wanted to use it but alas I didn’t get to make the change before I left the job.

The architecture was a simple producer-consumer model. Some piece of the system was responsible for placing a row into a table and another was responsible for finding those rows and processing them. As it turns out, the system required many more consumers than producers, which I realize is not all that uncommon.

(Before you go screaming at me about “enterprise” solutions like Oracle’s Advanced Queueing or JMS, that’s not entirely the point. It’s incidental that this situation looks like a producer-consumer problem, but this pattern in more generally useful. So bear with me and think about how to apply it elsewhere.)

So, applying it to an email system where one piece of the system generates the emails and dumps them into a table and another piece of the system takes them out and sends them, you might have a table that looks like this:


CREATE TABLE email_jobs
(id NUMBER NOT NULL
,email_to VARCHAR2(255) NOT NULL
,email_subject VARCHAR2(255) NOT NULL
,email_body VARCHAR2(255) NOT NULL
,insert_ts DATE DEFAULT SYSDATE NOT NULL
,update_ts DATE
,processed_ts DATE DEFAULT SYSDATE NOT NULL

You can imagine the consumer might wake up, ask for the oldest 10 items in the table, send them off in batch, and then go back to sleep. As you might expect, I had a recurring problem where 2 consumers were both attempting to pull the same item from the table and process it. In the above case, a bug like that might lead to the person getting 2 identical emails, which no one wants. There are ways to protect against these kinds of things at the level, but in reality you just want to ensure that no 2 consumers get the same item.

Enter the request token. With this, each consumer produces the a unique indentifier and marks the rows that it wants with that value. It then requests only the rows with that token, making it virtually impossible to have the same row processed by 2 different consumers.


CREATE TABLE email_jobs
(id NUMBER NOT NULL
,email_to VARCHAR2(255) NOT NULL
,email_subject VARCHAR2(255) NOT NULL
,email_body VARCHAR2(255) NOT NULL
,request_token VARCHAR2(255)
,insert_ts DATE DEFAULT SYSDATE NOT NULL
,update_ts DATE
,processed_ts DATE DEFAULT SYSDATE NOT NULL

Notice the addition of the request_token column. On the application side:


//produces a unique number
$token = generate_token()


//mark some rows with the token – only where the request_token is already null – important!
UPDATE email_jobs SET request_token = $token WHERE <….find oldest rows…> AND request_token is null


//do this so other consumers won’t see these rows
COMMIT


//go back and find the ones that you marked
SELECT ej.id FROM email_jobs ej WHERE request_token = $token

Even if you have more than one process hitting that table, one of them will overwrite the other’s value for the request_token. Therefore, unless your application is sensitive to the number of rows each consumer processes, this is completely safe in that it won’t lead to multiple consumers processing the same row.

In general, the request token pattern pre-marks some data so that it’s easy to find later on. Another example that I’ve used in the past is in account creation. What frequently happens is that you have to insert a row and the update it soon after. The problem is that the insert generates a new unique ID that the update needs to know, but sometimes doesn’t. My solution has been to pass a request token to the code that does the insert and then pass that same value to the code that does the update. As long as the request token is unique they should both be able to address the correct row.

At this point you might have the idea to create the request_token column with a UNIQUE constraint so that no two rows can have the same value. Not so fast. In an even more useful case, there have been times when I’ve had to create a bunch of rows and then manipulate them in bulk. So, for instance, create a bunch of new accounts and set their email address to the same value. Without a column like the request_token, you’d potentially having nothing to group them by except for an insert_ts or similar column. With the request_token, it becomes a very easy thing to do.

3 comments

Sets – When PK’s Are Too Much

One design problem that trailed around behind me for years was how to handle the situation where the relationship between two elements doesn’t nicely fit the traditional role of a primary key/foreign key relationship. For example, let’s say you created a cool product like Scrabulous and you want to model the players participating in a game. It would be straightforward to model it as:


CREATE TABLE games
(id NUMBER PRIMARY KEY
,status_id NUMBER NOT NULL REFERENCES lk_status(id)
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)


CREATE TABLE game_players
(game_id NUMBER REFERENCES games(id)
,user_id NUMBER REFERENCES users(id)
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)

Nothing tricky here – a design like this will do the trick 99% of the time. There will be times when you have to look up a game by its status, or find all games created in the past 30 days (e.g. for reporting purposes), or find all games with a particular user_id in it. A “game” and a “game_player” are concrete, first-class entities in your system.

The optimal design is a little less clear when one part of the relationship is not truly an meaningful, independent entity. For example, let’s say you are building a sports application and need a way to model a matchup of 2 teams. Let me walk you through the evolution of the design. The first attempt might be:


CREATE TABLE matchups
(id NUMBER NOT NULL PRIMARY KEY
,home_team_id NUMBER NOT NULL REFERENCES teams(id)
,away_team_id NUMBER NOT NULL REFERNCES teams(id)
,matchup_date DATE NOT NULL
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)

This design sucks. If want to find all the matchups for a particular team you’d have to inspect both the home_team_id column and the away_team_id column. Queries using an OR aren’t as bad as outer joins, but they aren’t great and should be avoided when possible. So, next iteration. Instead of trying to cram it all into 1 row, you decide to explode it so that a single matchup is described by 2 rows such as this:


CREATE TABLE matchups
(id NUMBER NOT NULL PRIMARY KEY
,team_id NUMBER NOT NULL REFERENCES teams(id)
,is_home_team CHAR(1) NOT NULL
,matchup_date DATE NOT NULL
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)

A little better, but something’s not quite right. A single matchup between 2 teams now produces 2 rows, but the rows aren’t linked in any way. In other words, it’s near impossible to find both sides of a matchup because nothing (other than possibly matchup_date) associates them. So in the next iteration we introduce the concept of a “matchup set”.


CREATE TABLE matchups
(id NUMBER NOT NULL PRIMARY KEY
,matchup_set NUMBER NOT NULL
,team_id NUMBER NOT NULL REFERENCES teams(id)
,is_home_team CHAR(1) NOT NULL
,matchup_date DATE NOT NULL
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)

So we can now identify both teams in a matchup because their matchup_set are the same:


SELECT m.id, m.is_home_team
FROM matchups m
WHERE m.matchup_set = 12345

This would return 2 rows. At this point, your refactoring senses start tingling. Every matchup set would implicitly have the same MATCHUP_DATE. Wouldn’t it be better to have a parent table called…well, not sure really what to call it. Guess you’d have to change the MATCHUPS table to be MATCHUP_DETAILS, and then create a new MATCHUPS table. Extracting out MATCHUP_DATE you are left with this design:


CREATE TABLE matchups
(id NUMBER NOT NULL PRIMARY KEY
,matchup_date DATE NOT NULL
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)


CREATE TABLE matchup_details
(matchup_id NUMBER NOT NULL REFERENCES matchups(id)
,team_id NUMBER NOT NULL REFERENCES teams(id)
,is_home_team CHAR(1) NOT NULL
,create_ts DATE default SYSDATE NOT NULL
,update_ts DATE)

Doesn’t seem like a big win though. Now we have 2 tables, which creates an extra join in most situations:


SELECT m.id, m.matchup_date, md.team_id, md.is_home_team
FROM matchups m, matchup_details md
WHERE m.id = 12345
AND m.id = md.matchup_id

Now here’s the big IF: if you will ever need to reference a matchup in another table, the last iteration is the correct design. But, again, in that situation a matchup implicitly represents a first-class entity. For those situations where the matchup is a loose association of teams, though, you should consider the benefits of the previous iteration.

In a sense, this pattern inverts the last iteration’s design. Instead of teams belonging in a matchup, the matchup is a property of 2 different teams. As usual, this flexibility comes with benefits and costs. When used correctly you gain a number of things:

  1. fewer joins – there’s no parent table to have to join to all the time
  2. fewer inserts – again, no parent table to have to maintain
  3. #1 and #2 combine to reduce the possibility for your data to become corrupt
  4. there’s less meaningless data in the system – the parent MATCHUP table is mostly cruft

For this you trade the ability to use the matchup set as a foreign key. In some situations, that’s not only perfectly acceptable, it’s preferable.

0 comments

Reduce Complexity with Self-Expiring Data Columns

I thought I’d start off with a cool database design trick that I find use for frequently. Here’s the setup. You have a piece of information that you need to expire after some period of time. For example, you might send someone a message that they have to act on in 10 days or the message disappears. Here’s what relevant table might look like (Oracle syntax):

CREATE TABLE messages
(id NUMBER NOT NULL PRIMARY KEY,
,from_user_id NUMBER NOT NULL REFERENCES users(id)
,to_user_id NUMBER NOT NULL REFERENCES users(id)
,message VARCHAR2(2000) NOT NULL
,is_expired CHAR(1) default ‘n’ NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

The brute force method is to have a separate backend process that goes through every night and expires all messages that meet the following criteria:

SELECT m.id
FROM messages m
WHERE m.create_ts <= sysdate-10

Then if you want to find all unexpired messages for a particular user you would use

SELECT m.id
FROM messages m
WHERE m.to_user_id = 1234
AND m.is_expired = ‘n’

All is great, until one night your process fails, and the developer who wrote the process is on vacation, and chaos ensues. You have to rerun the process, but no one’s sure how. You know how it goes. Moving parts have a tendency to break.

My preferred solution for this reworks the data model and the query to put the burden of work on the query, removing the need for the backend process. It’s not as bad as it sounds. Here’s the reworked table definition:

CREATE TABLE messages
(id NUMBER NOT NULL PRIMARY KEY,
,from_user_id NUMBER NOT NULL REFERENCES users(id)
,to_user_id NUMBER NOT NULL REFERENCES users(id)
,message VARCHAR2(2000) NOT NULL
,expires_ts DATE NOT NULL
,create_ts DATE default sysdate NOT NULL
,update_ts DATE)

Notice that is_expired changes to expires_ts. So instead of having a flag to say whether the message is expired, we get a little less literal about it – a message is expired when expires_ts is in the past. One goal is met – there’s no need for a nightly process to expire the messages. If you want to find all the unexpired messages for a particular user you would use

SELECT m.id
FROM messages m
WHERE m.to_user_id = 1234
AND m.expires_ts < sysdate

Not significantly heavier for the database to process as long as there’s an index on to_user_id. Here’s a quick summary of the pros and cons at work here:

Pros:

  • Reduction of moving parts – no separate process needed to expire items manually
  • We gain system meta data – previously we only knew whether an item was expired or not. Now we know when it was expired. If you throw in an index on expires_ts, you can easily track expired items across the entire system. For the price of changing the data type of one column and throwing in an index we gain more insight into our application – I love that kind of thing
  • The code entering the data can change the expiration date according to business rules built into the system. Here’s some homework – sketch out how the backend process (and the system in general) would have to change if the data in the MESSAGES table was intended to have different expiration dates – for example if some messages were supposed to expire in 10 days and other in 15. The self-expiring data pattern avoids that mess elegantly
  • Remove the forced and unsavory reliance on create_ts. For example, think about what must happen if the business folks request that a message’s expiration time can be extended so that it will expire in 20 days instead of 10. How does the brute force method handle this? By changing the create_ts value? By relying on update_ts instead? By entering a new row? None of those options are great.

Cons:

  • All queries against the MESSAGES table have to check the expires_ts column. This can lead to a lot of bugs for those unfamiliar with the data model.

Something to Consider:

I always consider the case of what would happen if a batch of rows would have to be expired in bulk (say 100,000 rows). In the brute force model it’s easy – set is_expired to “y” for those rows. For the self-expiring data pattern you have to find the rows and update the expired_ts to some date more than 10 days ago. I have some objections to this since you are altering data in a way that makes it invalid. It didn’t really expire 11 days ago, you are just saying it did so it doesn’t show up on the live site any more. Reports on that data might now include it erroneously, etc. This could be a minor point or a large problem depending on your system requirements.

Something Not To Consider:

More isn’t better. I do not recommend putting both an is_expired and a expires_ts column on the table because it’s asking for trouble when the data inevitably gets corrupted. In other words, what happens if somehow the is_expired gets set to “y” but the expires_ts is in the future? What should the code do? What about the guy who’s trying to track down a bug and isn’t sure if that’s normal? It’s bad news. Avoid it.

I think it’s clear that the trade off is worth it in almost all cases – the self-expiring data pattern is incredibly useful and certainly a huge improvement over the brute force method.

4 comments