Archive for the 'beginner' Category

Be A Data Integrity Watchdog

Funny thing happens when you start to put data into a database. It becomes important. At one point it might have seemed like a nice idea to save the visitor’s IP address. Slowly, as the system evolves, little branches of code pop up around the fact that the IP address is populated. Suddenly you find yourself in a position where you have to protect that piece of data. You can’t sit by idly and let that improperly formatted IP address bring down the whole system. You have to guard your system against these intrusions.

And the intrusions will happen. I’ve designed a number of large systems, and the only common denominator is that somehow, at one point or another, at least some of the data will get corrupted. Transactions fail, databases crash, bugs show up in the margins, users enter in stupid information, or hackers attack. I had been spending time thinking about this issue at The Sporting News, but it wasn’t until MLB that it really congealed into something useful.

In the fantasy baseball domain, a typical roster transaction leads to the addition or removal of a player from a particular manager’s roster of players. If the player is added to player P’s roster he should not be available for any other manager in that league. If he’s removed from P’s roster he should be available to all other managers (including P). A player can’t be on more than one manager’s roster at the same time.

It turns out that every once in a while something hiccups and one of these rules is violated. Over the years I learned that the single most important step to fixing the problem is to make sure it doesn’t get worse. So, for instance, imagine a manager attempts to drop a player from his roster but something goes wrong. The system shows that the player is still on the roster, but he’s also technically available to others. Now that the data is corrupt it’s crucial that the system not allow the player’s status to be modified any further. It can very quickly become an impossible problem to solve if the player is picked up by another manager, then traded to another team, then dropped, etc.

I’ve spent many, many hours fixing transactions by hand. I worked on fantasy applications for over 7 years continuously, and in that time I can’t remember a single year where I didn’t have to fix at least some transactions by hand. Let me tell you, it suuuuucks. It really suuuuucks. Sucks and blows.

With that in mind, I developed a scheme where instead of waiting to hear that some player is on two teams via the message boards I take matter into my own hands. I designed the system to check each player involved in a transaction for corruption immediately after the transaction is committed. If I detect that one of the fundamental rules were broken (e.g. owned by more than one manager), the player is immediately frozen. No further transactions on that player would be allowed until an administrator can come in and fix the issue.

So for a small incremental cost I’ve bought myself some peace of mind. And I can absolutely tell you that it paid off, time and time again. It just took a different way of thinking about the problem – being proactive versus reactive, protecting the integrity of all that data.


Wipe Your Feet Before You Come Into My House

My code is my house. I spend a lot of time in it. I fix it up, take care of it lovingly. I indent appropriately and actually spend time spacing out sections so they’re pleasing to the eye. I do this only partly because I’m obsessive compulsive. My greater motivation is that I really feel that these things matter.

Think about the word “code” for a minute. I love the word. I am a coder. I write code. What is code? Code is something that means something to the person that writes it, means something to some people/machines that read it, but means nothing to people who don’t know how to read it. Code is inherently cryptic. So the act of writing code is a struggle against entropy. Over time the code’s intent will change, its implementation will be less clear, or its documentation will drift out of sync with the actual representation.

As when you move into a house, code will never be as nice as it is on day one. Something breaks and you have to fix it quickly, leaving a hole in the wall. People come to visit and leave their shit around. Perfect code never stays perfect. So it’s critical that on day one the code is as clean and clear as it can be. And you should expect to do periodic improvements to keep entropy at bay.

Speaking practically, this implies a number of things. First and foremost, formatting matters. Spacing matters. These things help someone else determine the intention of the code you are writing. Related sections should be grouped together with spacing so someone reading knows what can be moved around and what should stay together. The goal is to make the code as pleasing to someone else’s eye as possible. We all know you are very clever, but a single line that chains together 50 method calls is impossible to decipher. Break it up and I’ll respect you more because you did it for me, not for you.

Everyone has a favorite format. The religious wars about curly braces probably consume half the storage space on slashdot’s servers. I’m not entirely above it – I’m infamous for reformatting code when I take control of it. But if I’m just visiting someone else’s code I have a strict policy that the code I write should be indistinguisable (as much as possible) from theirs. This means formatting it the way they do. Using the same naming conventions. Following their capitalization scheme. The point isn’t to show others how superior my formatting is. It’s to make sure that someone else reading the code doesn’t have an anuerism.


The Beauty of is_active and display_order

There are a couple of columns that I always include on any tables that contains that that will be displayed to front end users. Over time I’ve developed a preference that these be declared as CHAR(1) with a check constraint to limit them to y and n (or Enums in MySQL). The columns are:

  • is_active
  • display_order

The is_active column is used to determine whether that row should be included in the result set.

FROM users u
WHERE u.employer = ‘R/GA’
AND u.is_active = ‘y’

This is cheap way to support a “safe” delete. I’m very wary of ever removing non-corrupted data from a database. Besides the fact that it’s the database’s job to keep data, it’s a dangerous thing. All it takes is a simple bug in the routine that deletes the rows and you get to spend all afternoon with a DBA recovering data from tape. Or worse yet explaining to an end user why they have to re-enter their data. It’s usually not worth the “purity” of the data. The is_active column helps avoid that mess, and keeps the data around so that it can be used later for reporting, etc. Think about it – you have a business owner breathing down your neck because there’s some invalid data showing up on the front end. Which would you rather do? Update a column from “y” to “n” knowing you can change it back if you made a mistake, or delete the rows permanently and hope you didn’t mess up?

The display_order column is a acknowledgement that everyone will have an opinion on how data should be displayed on the front end and the opinion always changes as you go up the corporate ladder.

SELECT c.color as featured_colors
FROM couches c
WHERE = ‘lazy-boy-recliner’
ORDER BY c.display_order

It’s never a good idea to rely on row IDs unless you truly have to. So by having a built in display_order column we allow the business to change their mind as frequently as they want with minimal impact to our pretty code.

Both these columns embody my personal philosophy of letting the database do as much of the work as possible. It takes a while to learn to use the database as more than just a data store – to learn that it can have its own inherent, hidden logic as much as your client code can. In the end it leads to more robust data models that can stand up to changing requirements and emergencies.


Dangerous Style

Slow week, so thought I’d vent. The top 3 things that seem like good ideas at the time:

  1. Using IDs in WHERE clauses when a CHAR or VARCHAR column could be used. This is another of those pesky “proper design” practices. As a developer, you should always assume that a row’s ID can change at any time. This is most tempting when joining to lookup tables, where you think the IDs will never change. That might be true in the development environment, but maybe the IDs turn out to be different in the production environment for any number of reasons. All your SQL breaks. It’s always safer to use a unique identifying string value in joins where possible. And if your lookup tables weren’t designed with a unique key on a string column, you should create one. It’s much much safer to do:

    SELECT FROM people p, groups g WHERE p.group_id = AND = ‘friends’

    than it is to do:

    SELECT FROM people p WHERE p.group_id = 1

    At that rate, why have a lookup tables there in the first place? (No, I’m not really suggesting this, although I have worked on systems where this was commonplace).

  2. Misuse/Overuse of IN clauses. When I’m writing ad-hoc queries I tend to throw around the IN clause liberally because it’s usually quicker than a join. In production code, you should severely restrict your use of it. For one, it’s not the most effiecient clause for the database to execute. For another, it’s error prone.Here’s the wrong way to use an IN clause:

    SELECT FROM people p WHERE p.group_id in (SELECT id FROM groups WHERE strangers = ‘n’)

    I guess some developers have the impression that it’s easier for the DBMS to optimize that because it looks like 2 separate queries squashed together. In most non-trivial cases it’s not easier to optimize. Here’s the right way to use an IN clause:

    SELECT FROM people p, groups g WHERE p.group_id = AND g.strangers in (‘n’, ‘notsure’)

    In most correct uses, it could be conceptually replaced with one or more OR clauses. So lay off the IN clauses please.

  3. SELECT * FROM …. queries. There is no reason to ever SELECT * from any table. No reason. It’s a lazy, horrible practice. Spend another 3 minutes and protect yourself from the bugs that will eventually arise when someone adds or modifies a column to that table. If you’re really too lazy to type out the column names, you can do something like this:

    SELECT column_name FROM user_tab_columns WHERE table_name = ‘MESSAGES’

    The table USER_TAB_COLUMNS is part of Oracle’s data dictionary, which I’m hoping to cover more in future posts. That’s just one way to be lazy and productive at the same time.


Proper Use of Check Constraints and Enums

Something that comes up from time to time is the proper use of Oracle column check constraints or MySQL enum type columns. Typically the question is when it’s appropriate to use those features to enforce data integrity as opposed to having foreign keys to a lookup table.

For example, here is a table that uses MySQL enumeration values to enforce that the the log message level is either “debug”, “warning”, or “error”:

CREATE TABLE log_messages
,message_level ENUM(‘debug’, ‘warning’, ‘error’)
,message VARCHAR(255)
,create_ts TIMESTAMP default sysdate() NOT NULL
,update_ts TIMESTAMP)

The first consideration is whether the enumeration of message types is locked down to those three for the lifetime of the application. In the above case, that seems perfectly within the bounds of solid design practices. There might be one or two conceivable states added, but that’s what keeps developers and DBAs employed. On the other hand, if the list of types will grow gradually there’s a problem, as in the below table:

CREATE TABLE log_messages
,message_level ENUM(‘debug’, ‘warning’, ‘error’)
,job_name enum(‘send_emails’, ‘clean_orders’, ‘calc_revenue’)
,message VARCHAR(255)
,create_ts TIMESTAMP default sysdate() NOT NULL
,update_ts TIMESTAMP)

It should be clear that the list of jobs will gradually grow to include ones that you couldn’t conceive of when you designed the table. It’s certainly quicker to use an enumeration here, but in the long run it’s going to cause headaches. The preferred redesign involves 2 tables:

CREATE TABLE log_messages
,message_level ENUM(‘debug’, ‘warning’, ‘error’)
,message VARCHAR(255)
,create_ts TIMESTAMP default sysdate() NOT NULL
,update_ts TIMESTAMP)

,name VARCHAR(20)
,create_ts TIMESTAMP default sysdate() NOT NULL
,update_ts TIMESTAMP)

Where log_messages.job_id = Of course, this points out one more problem with the overuse of check constraints or enumerations. What happens if you want to reuse that column value in another table as a foreign key? For instance, you might have another table that stores the total run time for each job. You would have to add the same constraint/enum on that other table, and make sure they stay in sync as the system evolves. This duplication should raise the hackles of any experienced developer.

So the rule of thumb is that it should be used sparingly, in cases where the enumeration values are unlikely to change frequently, or in places where you are sure you’ll never need to reuse the values as foreign keys in other tables. One of my favorite places to use this is for boolean yes/no columns such as:

,name VARCHAR(255)
,is_active ENUM(‘y’, ‘n’) default ‘y’ NOT NULL
,create_ts TIMESTAMP default sysdate() NOT NULL
,update_ts TIMESTAMP)

Obviously there would be little reason to extend the ‘y’ and ‘n’ values. And there would not be a meaningful reason to externalize those values so they could be reused in other tables.

Something to Consider

Keep in mind that the MySQL enum type has some other features that might come in handy, such as the ability to index each enumeration value using a number. That may also influence your decision to use it in places where you otherwise shouldn’t depending on the application


« Previous Page