In any database, whether it’s SQL Server, Oracle, Sybase, DB2, Access (does Access count?), or anything else, the primary bottleneck is disk IO. Even if you have an SSD, the disk is the slowest part. In those databases we use indexes to alleviate this. Their query engine uses the index to find exactly where on the disk to start and stop reading, which keeps it from scanning entire tables to find every answer.
Netezza has an entirely different solution to the IO problem (in fact, it has no indexes). It cuts the cost of disk reads by doing the work in parallel. Simply put, it’s a brute force solution to the problem using specialized hardware, but that’s just the start.
The number of disks and S-Blades vary depending on the Netezza model. The N3001 for example has 240 disk drives for user data, and 7 internal S-Blades. Let’s walk through how all this ties together and makes it fast:
First your data is compressed and spread across many, many hard drives. This is how the disk IO is parallelized. Netezza reads portions of the data from each drive simultaneously.
The data is read from the disks by FPGAs. These decompress the data quickly as it makes its way through, so it in the time it takes to read x KB of data from the disks, we actually get 4x to 8x KB. We have now parallelized disk IO and boosted disk IO!
The FPGAs also apply filter logic from your query. If a date range is specified in the where clause, the FPGAs can filter for it (depending on the complexity of your query). This is a fast way to cut down the total data that will need to be handled by higher-level functions.
Disk IO is now parallelized, boosted, and reduced.
Joins, aggregations, and user-defined functions are handled in the next piece of hardware, the S-Blades. Each S-Blade has its own CPUs and RAM, is directly connected to two FPGAs, and are ‘owners’ of some of the total disks. Each gets its own piece of the data and does work at the same time as the others.
The S-Blades are all connected by a high-speed internal network that IBM calls the “fabric”, which is used if they need to share information. For example, if you are joining tables each S-Blade may need to pass its part of the table around to the others.
The last piece is the host. This is the part that you interact with directly. It takes your query and parses it into instruction sets (called “snippets” by IBM). Each S-Blade gets a copy of the snippets, processes them (the S-Blades are also called ‘SPUs’, for snippet processing unit), and returns their data to the host. The host puts the returned sets together and sends them on to you.
All this hardware works together so you can handle data on a massive scale. You can load gigs of data in minutes. You can scan billions of rows in seconds.
I’ve been working with the machine for almost 2 years now. I sometimes hear of developers or DBAs wrangling tables with tens of millions of rows and I think, “What’s the issue?” Netezza has ‘skewed’ how I think of large data (pun intended, Netezza users will get it).
The largest table in my warehouse right now is over 8.7 billion rows, and it’s a non-issue. It’s just as easy to handle as the tables with tens of thousands of rows. Earlier today I selected distinct on two columns from that table, and it took 20 seconds (result set 22 million rows). If I needed to do this on a SQL Server it might be impossible. I would have to devise code to do it in stages, or simply let the machine crunch on it for however long it takes and hope it doesn’t run out of memory. (And to be clear, I really like SQL Server. It is my DB of choice for application back-ends, a role which Netezza would not fulfill well).
My point is, I think Netezza has achieved the goal of handling ‘big data’. It’s very different from a traditional database (no indexes, no primary key enforcement, no foreign key enforcement). In fact, some keen people will point out that it is not a database, it’s specialized hardware with a database façade. This means that you will need to think differently when using Netezza. It’s not hard to master, but you will need to open your mind, because Netezza will defy many lessons we all learned the hard way on traditional databases.