Zone Maps: Netezza’s Answer to Indexes in Big Data

There are no indexes in Netezza. This is an intentional design choice. Indexes are effective in an OLTP system, where transactions affect a single row or two, but not in an OLAP system, where transactions affect large sets of data. Netezza is designed to be a fully OLAP-focused system, in both hardware and software, and indexes did not meet that goal.

The Cost of Indexes in Big Data

Index Maintenance

In an OLTP system data is loaded in small batches, often single rows, and the system can balance thousands of transactions at a time. Each of these transactions requires any related indexes to be updated. In many cases the cost is imperceptible. In more extreme situations maintenance windows and even system downtime might be required.

In a data warehouse we load large batches of data, often millions of rows at a time. On this scale it can take more time to update the indexes than it takes to write the rows to disk (I recently saw a table with 18.5GB of indexes, only about 5GB less than the table itself). As the table grows so does index maintenance, and more so with every new index. At some point we will cross the threshold of diminishing returns.

Limited Use

Indexes perform great for small, precise selections. We can quickly find a specific record in a table even if it has millions of rows. The index is a fraction of the size of the table, and can be searched or scanned more quickly than the table itself. But when a table becomes very large (tens of millions of rows, hundreds of millions, billions…) the indexes also become large. On a large index a seek can be slow, and a scan might take minutes.

Analytic queries rarely seek a single row, instead they consume large swaths of data. When a query is aggregating a hundred million rows over years of data, an index may not be useful. The database engine might rightly decide that skipping the index and doing a full table scan is quicker.

Verdict: Indexes are Insufficient for Large-Scale OLAP

A traditional database is great at high-speed transactional processing, but not at large-scale data acquisition and reporting. Although they have done well in the past, the amount of data needed for analytics is growing at a rate that is outpacing the technology. Indexes are meant to reduce disk IO, but now they are becoming so large that they are presenting their own disk IO problems.

The Netezza engineers knew that their OLAP-specific database needed a new solution for large-scale data filtering, one that could work for finding hundreds of rows out of thousands just as easily as millions out of billions.

Zone Maps

Tables are stored on 128KB pages, which are grouped into 3MB extents. A zone map is the high and low values of the columns in each page or extent (Netezza decides when it is more effective to keep maps at the page level or extent level).

Each SPU has a table of zone maps for the disks that it controls (if you’re unfamiliar with the Netezza hardware, please see my Intro to Netezza). When a query is executing, the SPUs each scan their zone map tables. If a needed value falls between the ranges on the zone map, the FPGA will read those pages.

Netezza Zone Map Basic Diagram

Many sales materials or guides that I have seen describe zone maps as an “anti-index”, or as something that tells Netezza “where the data isn’t”, but this is not an accurate description. The patent for zone maps states:

Whereas an index specifies where a particular piece of information is, the nearly ordered map [zone map] … specifies a range or set of ranges where a particular piece of information may be.

Limiting scans of loosely ordered and/or grouped relations using nearly ordered maps

Pages are read whenever it might contain needed data. In the example above, there may not be any rows with an ID value of 12. All the zone map does is tell the SPU that page 3 might have rows that it needs.

This is an efficient design in Netezza’s hardware. Zone map usage is independent and parallelized across SPUs.

Zone Maps do not Require Ordered Data

The diagram above is a bit misleading because all the ranges are in ascending order. And if you order your data as you insert, the zone map will look like this, and every row in those pages will be in order. However, that ordering doesn’t matter. Remember zone maps are not indexes.

The SPUs will consult their zone map tables on which extents or pages to read. Then the FPGAs will read the entire pages. They will never extract just a single row from disk.

So the rows within the pages do not need to be in order, and in practice they generally won’t be. It does not have any performance effect.

Netezza Zone Map Basic Diagram 2

Here the SPU that holds these pages will figure out it needs to read page 3. The FPGA will decompress the rows and check if any of them actually have a DATE value between the query range and a matching SITE_NUMBER (this happens so fast as to be indistinguishable from a simple disk read operation), and only relevant rows will be passed upwards to the SPU for further processing.

Additionally, if we re-ordered the pages above, so that page 4 was first, page 2 was third, and so on, the same page selected above (wherever it might end up being) is still the only page that would be read.

Zone Maps can Cluster Similar Data

In the example above you probably noticed that the DATE values overlap between multiple zone maps. This is typical. The more perfectly contiguous the zone map values are, the more ‘selectable’ the field is rated (this is called selectivity).

Let’s use a table (SALE_TRANSACTION) with columns DATE_ID, STORE_ID, ITEM_ID, and SALES_AMOUNT, and fill it with some test data (just shy of 3 billion rows), with a wide spread of values for the columns. We’ll distribute on random with no ordering as we load it, which should get us zone maps with lots of repetitive ranges. Then we’ll take a look at some of the extents:

select	_extentid,
	min(DATE_ID) as LOW_DATE_ID,
	max(DATE_ID) as HIGH_DATE_ID,
	min(STORE_ID) as LOW_STORE_ID,
	max(STORE_ID) as HIGH_STORE_ID,
	min(ITEM_ID) as LOW_ITEM_ID,
	max(ITEM_ID) as HIGH_ITEM_ID,
	count(*) as RECORDS_IN_EXTENT
from	SALE_TRANSACTION
where	datasliceid = 1 –- limit to one disk, just for ease of example
group by
	_extentid;

Actual Netezza Zone Map Figure 1

As you can see, the same values are repeated across many extents. The selectivity of DATE_ID is 0.14% according to Aginity.  As an experiment, let’s find out how many pages would need to be read if we were searching for a certain date range:

create temp table T_PAGE_DATA as 
	select	datasliceid || '_' 
			|| _extentid || '_' 
			|| _pageid as UNIQUE_PAGE_ID,
		min(DATE_ID) as MIN_DATE_ID,
		max(DATE_ID) as MAX_DATE_ID,
		min(STORE_ID) as MIN_STORE_ID,
		max(STORE_ID) as MAX_STORE_ID
	from	SALE_TRANSACTION_UNORDERED
	group by
		UNIQUE_PAGE_ID;

-- How many pages will need to be scanned for a given date range?
select	count(distinct UNIQUE_PAGE_ID) as PAGE_COUNT
from	T_PAGE_DATA
where	MIN_DATE_ID <= 20150331  and 	MAX_DATE_ID >= 20150101;

This tells us that to find all values in the date range 1/1/2015 to 3/31/2015 would require 321,818 pages to be read (39GB of disk IO). That is how many pages Netezza would need to check based on a page-level zone map.

Let’s try re-ordering data in the table by DATE_ID and see what happens:

Actual Netezza Zone Map Figure 2

The date column is almost perfectly contiguous across extents. The selectivity of DATE_ID is 99.62%.

Now the test query shows that only 29,598 pages need to be read (about 3.5GB). The zone maps have reduced the disk IO necessary by about 90%!

If we commonly query on DATE_ID or STORE_ID, then we might want to find a happy medium where both DATE_ID and STORE_ID are as contiguous as possible. This is exactly what organize on does for us. Let’s create an organized version of the table. After creation you will need to groom the new table to set up the initial clustering. The initial clustering may be slow, but once it is done Netezza will automatically maintain the zone maps.

create table SALE_TRANSACTION_ORGANIZED as
select	*
from	SALE_TRANSACTION
distribute on
	random
organize on
	(DATE_ID, STORE_ID);

groom table SALE_TRANSACTION_ORGANIZED reclaim backupset none;
generate express statistics on SALE_TRANSACTION_ORGANIZED;

And the results:

Actual Netezza Zone Map Figure 3.png

Netezza has clustered the values together so that as few extents as possible contain each unique combination of STORE_ID + DATE_ID. The selectivity of DATE_ID has dropped to 90.90%, but STORE_ID has come up to 91.25%.

If we run our query again it requires 47,879 page reads (selectivity on DATE_ID went down slightly). But if we include a store in the query:

select	count(distinct UNIQUE_PAGE_ID) as PAGE_COUNT
from	T_PAGE_DATA
where	MIN_DATE_ID <= 20150331  and 	MAX_DATE_ID >= 20150101
and	219 between MIN_STORE_ID and MAX_STORE_ID;

..now only 1,639 pages must be read.

So there is some ‘give and take’ to zone maps, but the performance boosts can be enormous. Some fields may be complementary, but others may not work well together. Consider if I had organized on SALES_AMOUNT. Experiment with your data and see how it behaves.

Zone Mappable Columns

Not every data type can be used with zone maps. Zone maps will always be created for these:

  • date
  • timestamp
  • byteint
  • smallint
  • integer
  • bigint

Zone maps are also created for:

  • rowid
  • createxid

You can create zone maps for these additional types if they are utilized in an organize on setting. The zone map range values are stored as bigints, so for large data types Netezza will only use the first 8 bytes:

  • char / varchar / nchar / nvarchar (first 8 bytes only)
  • numeric (first 8 bytes only)
  • float
  • double
  • boolean

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s