Inside Netezza are specialized processing units called SPUs, which have their own sets of disks, each containing a portion of your data. The SPUs work in parallel to return the answer you need. But if one of them has more data than the others, then we will be waiting on just that one to finish while the others are already done and sitting idle. The workload is unbalanced because the data is skewed. To conquer big data, we need to divide work evenly.
What is a Distribution?
Distributions are NOT like clustered indexes. A clustered index makes it easier to look up a few specific rows. A distribution enables us to process billions of rows efficiently in parallel.
Your Netezza machine has dozens or hundreds of hard disks in it. The distribution tells it how to allocate the data for a table across all these disks. There are two ways to go about it:
A random distribution will spread records across all the disks evenly. If your team is having trouble understanding distributions, use random until you do. Everywhere, including any temp tables. I cannot understate this:
If you understand nothing about distributions, then as best practice always specify random, including on any temp tables that you create. Netezza does not default to random. I have reduced the time of many scripts simply by adding ‘distribute on random’ to the temp tables. One transform went from 34 minutes to only 80 seconds, merely by specifying a random distribution on temp tables that it created.
You will set the distribution in the ddl:
create table YOUR_TABLE_NAME_HERE ( ID int not null) distribute on random;
Tables distributed on random will always have a nearly perfect distribution:
When you use a column distribution, Netezza will take the values of that column and hash them into buckets. These buckets are your dataslices, or disks. Each row will be sent to that disk based on the hashed value of its distribution column. If you specify multiple columns, Netezza will combine their values and hash the combined value into buckets. This is important! If you use multiple columns in your distribution, that’s one distribution, not multiples.
Because our data is spread across the SPUs, they sometimes have to share data to complete a query. For example, if we are joining tables then each SPU may need every piece of at least one of the tables. Netezza is usually smart about spreading around the smaller table instead of the larger, but it can also do something even smarter.
If two tables have the same column distributions defined, then the rows with the same value will get distributed to the same SPU. This means that if the distribution columns are included in a join, Netezza knows that all the matching rows are on the same SPU, and they do not need to share information with each other.
This is called a co-located join, and is a huge boost for large table joins. Tons of work is being done in parallel. Joining billions of records takes seconds instead of hours.
We can also achieve co-located updates, inserts, and deletes. If our temp table shares the distribution of a target table, then when we join to do updates or inserts the SPUs do not have to share their parts of the temp table. It’s really just another co-located join, but the fact that it does not have to share every row to be inserted/updated/deleted just to figure out that it would land back on the same SPU it originated from can be a huge performance boost.
How to Choose a Distribution
If you’re not doing anything, start using random now. If you check your tables you will likely find that you have some data skew.
The primary goal of choosing a distribution is to spread the work evenly, not to boost query performance, although distributing well will give a boost, a large boost if co-located joins occur. The distributions preserve our capacity. If the SPUs have to share large amounts of data over and over again, then one or two queries will choke up the system. When this happens you will see one query dominating the machine, and every other query simply waiting.
You want to achieve as even a distribution as possible, so you should choose a column or group of columns with high cardinality, preferably ones that are frequently included in joins.
It does not need to be perfectly even, but be aware that if even one disk is very high, then you have a problem. That disk will be the bottleneck for your queries. This is called data skew.
Whichever SPU is reading from the disks on the far left is going to limit the speed of any query on this table.
Do not distribute on a date column alone. Although this may give you a perfect distribution, this prevents parallelizing the work. All the rows for a single date value, or similar date values, will end up close together. When your users query a date or small date range, one SPU will be doing all the work, and their query will be slow. This is called process skew.
You should try to use fewer columns for distribution, not more. This will allow more opportunities for co-located joins. If two tables share a distribution and you use that field in the join, they will co-locate, even if you also join on additional fields. The tables depicted in this join would co-locate:
create table SALES_HEADER ( TRAN_ID int not null ) distribute on (TRAN_ID); create table SALES_DETAIL ( TRAN_ID int not null, LINE_ID int not null ) distribute on (TRAN_ID); select * from SALES_HEADER SH inner join SALES_DETAIL SD on SD.TRAN_ID = SH.TRAN_ID;
These two tables are NOT on the same distribution and will NOT co-locate. The hash value of tran_id and the hash value of tran_id+line_id are not the same:
create table SALES_HEADER ( TRAN_ID int not null ) distribute on (TRAN_ID); create table SALES_DETAIL ( TRAN_ID int not null, LINE_ID int not null ) distribute on (TRAN_ID, LINE_ID);
These two tables also are NOT on the same distribution and will not co-locate. The names of the columns are meaningless. Remember the values are hashed, and the hashed integer value of TRAN_ID and the hashed string value of TRAN_ID are not the same:
create table SALES_HEADER ( TRAN_ID int not null ) distribute on (TRAN_ID); create table SALES_DETAIL ( TRAN_ID varchar(16) not null, LINE_ID int not null ) distribute on (TRAN_ID);
These tables DO share a distribution, but in the query below they will NOT co-locate because the distribution columns are not included in the query:
create table SALES_HEADER ( TRAN_ID int not null, STORE_ID int not null, DATE_ID int not null ) distribute on (STORE_ID, DATE_ID); create table SALES_DETAIL ( TRAN_ID int not null, LINE_ID int not null, STORE_ID int not null, DATE_ID int not null) distribute on (STORE_ID, DATE_ID); select * from SALES_HEADER SH inner join SALES_DETAIL SD on SD.TRAN_ID = SH.TRAN_ID;
In this last case, even though store and date may be redundant fields, it would be valuable to include them in the join so that they will co-locate. If the tables are very large and STORE_ID and DATE_ID do not already exist on SALES_DETAIL, it may be worth adding them and including them in joins so that co-location will occur.
This is something you will need to consider in your models. Remember that Netezza is not truly a database, it is hardware designed to handle huge data sets. It just happens to have a database front-end to make it easy to interact with. Thus, all the rules and best practices that usually apply to our data warehouses may need to be looked at more as ‘guidelines’ instead of ‘rules’. What’s best practice inside an RDBMS is not always best practice in Netezza, because Netezza is not an RDBMS.