An Intro to Netezza

In any database, whether it’s SQL Server, Oracle, Sybase, DB2, Access (does Access count?), or anything else, the primary bottleneck is disk IO. Even if you have an SSD, the disk is the slowest part. In those databases we use indexes to alleviate this. Their query engine uses the index to find exactly where on the disk to start and stop reading, which keeps it from scanning entire tables to find every answer.

Netezza has an entirely different solution to the IO problem (in fact, it has no indexes). It cuts the cost of disk reads by doing the work in parallel. Simply put, it’s a brute force solution to the problem using specialized hardware, but that’s just the start.

Netezza Hardware Profile
The number of disks and S-Blades vary depending on the Netezza model. The N3001 for example has 240 disk drives for user data, and 7 internal S-Blades. Let’s walk through how all this ties together and makes it fast:

First your data is compressed and spread across many, many hard drives. This is how the disk IO is parallelized. Netezza reads portions of the data from each drive simultaneously.

The data is read from the disks by FPGAs. These decompress the data quickly as it makes its way through, so it in the time it takes to read x KB of data from the disks, we actually get 4x to 8x KB. We have now parallelized disk IO and boosted disk IO!

The FPGAs also apply filter logic from your query. If a date range is specified in the where clause, the FPGAs can filter for it (depending on the complexity of your query). This is a fast way to cut down the total data that will need to be handled by higher-level functions.

Disk IO is now parallelized, boosted, and reduced.

Joins, aggregations, and user-defined functions are handled in the next piece of hardware, the S-Blades. Each S-Blade has its own CPUs and RAM, is directly connected to two FPGAs, and are ‘owners’ of some of the total disks. Each gets its own piece of the data and does work at the same time as the others.

The S-Blades are all connected by a high-speed internal network that IBM calls the “fabric”, which is used if they need to share information. For example, if you are joining tables each S-Blade may need to pass its part of the table around to the others.

The last piece is the host. This is the part that you interact with directly. It takes your query and parses it into instruction sets (called “snippets” by IBM). Each S-Blade gets a copy of the snippets, processes them (the S-Blades are also called ‘SPUs’, for snippet processing unit), and returns their data to the host. The host puts the returned sets together and sends them on to you.

All this hardware works together so you can handle data on a massive scale. You can load gigs of data in minutes. You can scan billions of rows in seconds.

I’ve been working with the machine for almost 2 years now. I sometimes hear of developers or DBAs wrangling tables with tens of millions of rows and I think, “What’s the issue?” Netezza has ‘skewed’ how I think of large data (pun intended, Netezza users will get it).

The largest table in my warehouse right now is over 8.7 billion rows, and it’s a non-issue. It’s just as easy to handle as the tables with tens of thousands of rows. Earlier today I selected distinct on two columns from that table, and it took 20 seconds (result set 22 million rows). If I needed to do this on a SQL Server it might be impossible. I would have to devise code to do it in stages, or simply let the machine crunch on it for however long it takes and hope it doesn’t run out of memory. (And to be clear, I really like SQL Server. It is my DB of choice for application back-ends, a role which Netezza would not fulfill well).

My point is, I think Netezza has achieved the goal of handling ‘big data’. It’s very different from a traditional database (no indexes, no primary key enforcement, no foreign key enforcement). In fact, some keen people will point out that it is not a database, it’s specialized hardware with a database façade. This means that you will need to think differently when using Netezza. It’s not hard to master, but you will need to open your mind, because Netezza will defy many lessons we all learned the hard way on traditional databases.

3 comments

  1. Good post. I have been working with Netezza appliances for some time and have similar observations. The system is extremely fast when it comes to reading related operations but lacks on the SQL syntax flexibility and RI, which by design is non-existent.

    Since we both share the passion for Microsoft SQL Server I would only challenge your comment “If I needed to do this on a SQL Server it might be impossible”. Are you sure there would be much difference in similarly spec’d SQL Server appliance or PDW against properly designed data warehouse system?

    I would love to see the comparison.

    NB A couple of months ago I was giving presentation on PowerBI, the engine was able to handle in real time 1bn fact records on 4 core laptop (slice and dice). Solutions can be really fast if they are designed properly.

    Like

    1. Thanks for commenting!

      I think the problem you will run into is when you try to use SQL Server with big data sets enterprise-wide. For example, here are some stats in our current ELT process:

      We extract and transform an average 20 million rows from source systems each hour.
      Our data warehouse grows about 50 Million rows each day, currently totaling over 90 billion rows.

      SQL Server might get close with some REALLY super hardware and lots of careful engineering, but the great thing about Netezza is it just works. All this while serving ad-hoc queries from reporting tools and data analysts.

      I have no experience with the MS’s PDW, but I am curious about it myself. It seems like MS isn’t really marketing it though, in fact I think they have moved on to Azure Data Lake.

      I definitely agree that Netezza SQL isn’t quite as robust as T-SQL, but I have yet to find anything I can’t do in it.

      Like

  2. I heard that IBM is no longer adding any new functionality to Netezza and trying to move their customers to their dashDB. We have an old Twinfin12 that we’ll need to replace with something else. I had a friend that did a POC on dash and it failed miserably.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s