Top 10 (or 11) Data Mistakes

Over my time I’ve seen many unique technical solutions in databases. And now I can share some of the worst ones with you!

Disclaimer: I may have been the architect of some of these. Sorry DBA team!

“Always left join, otherwise you lose data.”

A developer told me this when I was just starting out in the IT field. Yes, I’m pretty sure he was serious. I’ve seen his code.

Using CHAR(1) for Boolean values instead of bit or Boolean types.

A boolean has two possible values. A char(1) has 256 possible values (null is not a value, but technically 257). If your code expects only two possible values, this opens the door to potential bugs. If you’re expecting either a ‘T’ or an ‘F’, what happens when you get an ‘X’? Does your code fail, or does it ignore the row?

I have also seen developers take advantage of a single character True/False field by giving it additional meanings over time, such as adding ‘P’ for ‘processing’ or ‘N’ for ‘not applicable’. The meaning of a column should not change over time.

SQL Server has a bit type. Postgresql has an actual boolean type. Use them.

Oracle, for whatever reason, doesn’t have bit or boolean, so you are stuck with char(1), but you can enforce restricted values through a check constraint.

Using NVARCHAR when you don’t need it

varchar uses 1 byte per character. nvarchar uses 2 bytes per character. If your system is only going to be using one language’s characters, then you only need varchar (for most languages). If you need to support multiple languages, use nvarchar.

You don’t need to waste the space if you’re operating in only one language.

Using other system’s surrogate keys as your keys

We often copy tables from one system to another. These are usually some sort of reference data, such as lists of stores, items, or suppliers. These tables often have their own surrogate keys, and it is sometimes tempting to mix these surrogate keys into your own system’s tables, so your system’s SupplierId column is the same as another’s.

This is very dangerous, because the surrogate keys can change!

I have seen it many times, that at some point the users of the source system will need to modify one of their entities, and they will decide that the easiest and cleanest way to do this is to delete it and recreate it. When this happens, your data has just become orphaned!

Or they may repurpose one of them. So SupplierId 12 was once ‘ACME’, but your company stopped doing business with ACME and the users of the source system decided to rename ‘ACME’ to your new supplier ‘A-1 Stuff’. So every row where your system was using ‘12’ to refer to ‘ACME’ it is now instead referring to ‘A-1 Stuff’, just as if someone had performed a mass update on your database.

Even in systems where the developers and users raise their right hand and take an oath that this will never happen, I have seen it happen.

It is much better to copy the data locally and use your own surrogate keys, and join to the source system on a natural key to update or append your reference data occasionally. Then, even if they delete an entry forever from the source system, your data’s integrity will be preserved.

Columns with double meanings

10 Data Mistakes - Double-meaning columns
This table has a list of matches and scores. If it was a ranked match then goals get counted in the ‘RankedGoals’ column. Otherwise, they are counted in the ‘UnRankedGoals’ column. Make sense? You can tell if it was ranked or unranked base on which field is populated.

Now, was match 102 a ranked match, or not?

Using with (nolock) in SQL Server

I have been guilty of this myself, until someone explained the problems it presents, and now I’ve spent a lot of time convincing others that it shouldn’t be used. The problem is, it sometimes gets results so fast, on systems that are very busy!

But when rows are being inserted or updated, SQL Server is moving things around in the background before the transaction is complete. For example, if a page split is occurring while you’re using no lock, you might get the rows on that page, you might not, or you might get the rows twice because they are on two pages at that exact moment.

Your results can be way off the mark because you’re reading the data in whatever state it happens to be in at that moment.

Putting tables for multiple systems in a single database

It doesn’t much space to create a new database. Yet I’ve seen many times someone puts a table for their process in an existing database that belongs to a specific existing system.

They say, “But it’s only one table, and it’s kind of related. It uses data from that system.”

But this creates several issues:

  1. It’s harder for the developer of the existing system to make major database modifications. What if they want to move to a new server? Or a new RDBMS platform? Or enforce some sort of specialized indexing scheme on all their tables (but now they have to work around your table).
  2. It’s harder for the DBAs to maintain. “We need to take the database offline for an upgrade. What do you mean it will take this other system down? Why is it also using this database?”
  3. If the database is for 3rd party software, all bets are off. They might have an upgrade that deletes unknown tables, or might claim that it violates their support contract and they may refuse support. Some vendors are protective of their databases and don’t want companies to ‘customize’ them.

Using NULLs

If at all possible, columns should be non-nullable. When I see a NULL it opens a whole set of questions; “What does NULL mean here? Was the attribute not recorded because it does not apply in this instance, or because of a software bug? Or does NULL have some other special meaning here?”

Some developers will try to give special meaning to NULLs. For instance, they might say that the absence of an attribute means this entity is in an inactive / incomplete / pre-whatever state.

Bitflag fields

Back in the programming days of yore, space had to be conserved carefully. One cool trick was to use bitflags. If you had several boolean variables, instead of storing each one as a boolean type (which took 1 byte) you could store them each as bits instead, so you could pack 8 boolean variables in a byte.

For example here is a byte of data where each bit is a boolean attribute:

10 Data Mistakes - bitflag

Stored as an integer this is ‘19’.

I’m sure there’s still practical applications for this technique, such as devices with little memory to work with or maybe high-performance graphics processing. But in a database bitflags should be parsed into separate columns. Having one column that can mean many different things can be problematic, and every query that needs to check one of the attributes in the bitflag has to parse it out. Which isn’t especially difficult, but you have to keep a reference list that defines what each bit means instead of letting the columns define each attribute.

Here’s an example of handling a bitflag in SQL Server:

declare @bitFlag int = 19;

select	case when @bitFlag & 128 != 0 then 1 else 0 end as IsRankedMatch,
	case when @bitFlag & 64 != 0 then 1 else 0 end as IsTournament,
	case when @bitFlag & 32 != 0 then 1 else 0 end as AllPlayersHaveDLC,
	case when @bitFlag & 16 != 0 then 1 else 0 end as OptimalConnections,
	case when @bitFlag & 8 != 0 then 1 else 0 end as CrossPlayEnabled,
	case when @bitFlag & 4 != 0 then 1 else 0 end as AllowSpectators,
	case when @bitFlag & 2 != 0 then 1 else 0 end as IsInOvertime,
	case when @bitFlag & 1 != 0 then 1 else 0 end as AllowJoiningInMatch;

Overusing Triggers

Triggers are a special tool that should be used sparingly, usually for audit purposes. But I’m sure we’ve all seen someone (or been someone) who learns about triggers and becomes enamored with using them for all sorts of processing flow.

I have seen triggers that sent emails. I have seen triggers that fired off batch processes. I have seen recursive triggers that caused the database to crash. I have even seen a trigger that set transaction isolation level read uncommitted and then ran a cursor over millions of rows.

BONUS ENTRY!

Clustered GUID Primary Keys

GUIDs have their place, I’m not saying don’t use them. And they can be a primary key, that’s fine too. But in SQL Server, if you create a table through the GUI, or do not specify a clustering preference in your ddl script, it will set the primary key as clustered by default.

In a clustered key SQL Server attempts to keep the physical rows in the same order as the key values. However, GUIDs are not sequential in nature. So whenever you are doing inserts SQL Server will have to constantly re-arrange physical rows and split pages it order to keep your clustered key.

In your ddl make sure to specify nonclustered:

create table NonClusteredPKExample 
	( PKColumn uniqueidentifier not null primary key nonclustered );

If you’re using the GUI you have to right-click the PK column and select ‘Indexes/Keys’ and you will see the option ‘Create As Clustered’. Make sure it is set to ‘No’.

NonClusterMenu

2 comments

  1. Using ‘with nolock’ on SQL Server

    Sorry – I’m going to have to disagree with this one. I have customers all over the map, big, little and in-between that do this because of the shared lock that SQL Server acquires. I have seen very real issues due to the locking and lock escalation.

    Conversely, the page split issue you mention is borderline theoretical. Yes, it is _possible_ to have a dirty read, but I can’t think of a time over the past decade when that’s actually been an issue. More than anything, it comes down to knowing your data.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s