‘Master Data’ is not as easy to define as other data sciences. Relational theory is mathematical in nature. Dimensional design is based on ease-of-use and taking advantage of system architectures. Master Data doesn’t have the same foundation. (more…)
I have a series of joins that insert a large number of rows into a table. As you can imagine, I want this process to perform as well as it can, and in Netezza that means finding the best distribution and utilizing zone maps where possible.
To optimize these joins I would tweak the tables involved, run the process, and kill it if it took too long. I had an idea of how long it should take in an ideal scenario, so anything much longer than that I halted and adjusted.
But the more I tweaked, the slower it ran. It just became worse and worse. So finally I reverted back to the original structure… and it was slower than ever! What happened? (more…)
Should a table always have a surrogate key? Is it really best practice, or should we stick to natural keys? Let’s dig in to see if there’s one clear answer. (more…)
There are no indexes in Netezza. This is an intentional design choice. Indexes are effective in an OLTP system, where transactions affect a single row or two, but not in an OLAP system, where transactions affect large sets of data. Netezza is designed to be a fully OLAP-focused system, in both hardware and software, and indexes did not meet that goal. (more…)
Want to view deleted data? See which rows were added in a transaction? These extra fields will allow you to do so. (more…)
People are in your data. They’re looking at it, even changing it. Should they be stopped? Did they break something? If so, who broke it? When? These are common questions that come up (especially after someone breaks something), and DBAs are often called upon to solve the problem. Here are three great ways to start identifying your data gremlins: (more…)
Inside Netezza are specialized processing units called SPUs, which have their own sets of disks, each containing a portion of your data. The SPUs work in parallel to return the answer you need. But if one of them has more data than the others, then we will be waiting on just that one to finish while the others are already done and sitting idle. The workload is unbalanced because the data is skewed. To conquer big data, we need to divide work evenly. (more…)
In any database, whether it’s SQL Server, Oracle, Sybase, DB2, Access (does Access count?), or anything else, the primary bottleneck is disk IO. Even if you have an SSD, the disk is the slowest part. In those databases we use indexes to alleviate this. Their query engine uses the index to find exactly where on the disk to start and stop reading, which keeps it from scanning entire tables to find every answer.
Netezza has an entirely different solution to the IO problem (in fact, it has no indexes). It cuts the cost of disk reads by doing the work in parallel. Simply put, it’s a brute force solution to the problem using specialized hardware, but that’s just the start. (more…)
ERROR: 256 : Code generation failure
In my experience this error means “I can’t figure out how to execute your query” or “The plan I created to execute your query failed.” (more…)