Netezza Table Design Best Practices

Technology - Netezza Table Design Best Practices

Here are a few tips which can make a significant difference in the efficiency and effectiveness of developers and users, making information available to them when developing and creating analytic objects.  This information can also be very helpful to data modelers.  While some of these recommendations are not enforced by Netezza/PureData, this fact makes them no less helpful to your community.

Here are several best practices and guidelines for designing tables in Netezza. Data distribution across various disks is one of the most important factors impacting performance. When choosing a distribution key, consider columns with high cardinality and those that will frequently be used in join conditions. Avoid using a boolean column which causes data skew, and distributing tables on columns often used in the where clause, as it will cause processing skew.

  • If possible, select the common key between the Dim and Fact tables; if not, select the key to ensuring that the larger table (Fact) is not redistributed. Choose Random Distribution only as a last resort, as it will more often lead to a table being redistributed or broadcasted. This is okay for a small table but will impact performance if done for large tables.
  • Use Char(x) instead of Varchar(x) when you expect the data to be a fixed length, as this not only helps to save disk space but also helps performance due to reduced I/O.
  • Where possible, use the NOT NULL constraint for columns in the tables, especially for columns that are used in where clauses (Restriction) or join conditions.
  • Use the same data type and length for columns that are often used for joining tables.
  • The largest table should be joined last in the query.
  • Use joins over subqueries.
  • Create materialized views for vertically partitioning small sets of columns that are often used in queries. The Optimizer automatically decides when to use the MV or the underlying table. Sorting the MV using the Order By clause on the most restrictive column will add an index.

Alter the table to Identify Primary Keys (PK)

  • Visually helps developers and users to know what the keys primary keys of the table are
  • Other IBM tools can also import primary key information as metadata (e.g. InfoSphere, Datastage, Data Architect, Governance Catalog, DBeaver, etc.)
  • The query optimizer will use these definitions to define efficient query execution plans

Alter the table to Identify Foreign Keys (FK)

  • Illustrate table relationships for developers and users
  • Foreign key information can, also, be imported as metadata by other IBM tools (e.g. InfoSphere, Datastage, Data Architect, Governance Catalog, DBeaver, etc.)
  • The query optimizer will use these definitions to define efficient query execution plans.

Limit Distribution Key to Non-Updatable Fields

  • This one seems obvious, but this problem occurs regularly if tables and optimizations are not properly planned, Causing an error will be generated if an update is attempted against a field contained in the distribution of a table.

Use Null on Fields

  • Using ‘Not Null’ whenever the field data and ETL transformation rules can enforce, helps improve performance by reducing the number of null condition checks performed and reducing storage.

Use Consistent Field Properties

  • Using the same data type and field length in all tables with the same field name reduces the amount of interpretation/conversion required by the system, developers, and report SQL.

Schedule Table Optimizations

  • Work with your DBA’s to determine the best scheduling time, system user, and priority of groom and generate statistics operations. Remember the relationship to when the optimizations occur in relation to when users need to consume the data. This operation is often not performed before users need the performance and/or are driven by DBA choice without proper consideration of other processing performance needs.  This has proven especially true in data warehousing when the DBA does not have Data warehousing experience and/or does not understand the load patterns of the ETL/ELT process.
%d bloggers like this: