One of the first things a newcomer to Netezza learns is that it does not enforce primary key or unique constraints. (It also doesn’t enforce foreign keys, but that’s a different can of worms.)
This makes massive data movements easier. Inserting a few billion rows? No problem! A typical database might spend a lot of time checking all constraints before committing the transaction, but Netezza allows you to simply throw it in there. IBM touts this in sales presentations, showing super-fast load times.
And it is good, but it’s also a two-edged sword. Because the constraints are not enforced, duplicate data can be present in a table. And while moving billions of records quickly is wonderful, at some point we have to acknowledge that we do not want duplicate rows in our data warehouse. A transaction, invoice, survey, or anything else that happened only once should not be recorded twice or more. Reports will double-count the event or entity.
This means that the burden of ensuring data integrity falls on us.
Either the transform process must ensure that duplicates do not end up in the warehouse, or data quality checks must be run afterwards. I prefer the former, because if you look for issues after they are in the data warehouse, then they could have already been used in reports, and important business decisions might have been made with inaccurate data. Depending on the circumstance, finding and cleaning bad data after it’s in the warehouse may be unacceptable.
To prevent duplicates from being inserted, we need to perform two checks as part of the transform process:
- The data set about to be inserted must have no duplicates.
- Inserting the data will not cause duplicates
If you put your transformed data set into a temp table before inserting, this query can be used to check for duplicates. It can be modified and used for any unique constraints that need to be checked.
select cast(case when count(*) > 0 then 'T_MY_TEMP_TABLE DUPLICATE FOUND' else '0' end as int) from (select 1 from T_MY_TEMP_TABLE group by KEY_FIELD /* KEY_FIELD_2, etc... */ having count(*) > 1) U;
If the temp table contains a duplicate, it will attempt to cast the string ‘T_MY_TEMP_TABLE DUPLICATE FOUND’ as an integer and throw an error, causing the transform to fail. If not, then we can move on to the insert step.
This insert statement joins to the target table to ensure that only rows that don’t already exist are inserted:
insert into MY_FACT_TABLE ( KEY_FIELD /* VALUE_A, etc...*/ ) select T.KEY_FIELD /* T.VALUE_A, etc... */ from T_MY_TEMP_TABLE T left join MY_FACT_TABLE F on F.KEY_FIELD = T.KEY_FIELD where F.KEY_FIELD is null;
The good news is that Netezza can do this very quickly. As a test I created a table with 16 billion rows, distributed and ordered on the key column. Checking for duplicate rows took just under 20 seconds.
If you do not want your transform to fail on duplicates, you can modify these steps to feed the duplicate records into error tables instead. This way you can make some data available for reporting and address the bad data at an appropriate time.
Whatever strategy you employ, the goal should be to keep bad data out of the warehouse, whether you fail on error, quarantine the data in error tables, or simply exclude it.