DataStage Large VARCHAR Performance

To the IBM Infosphere Information Server Parallel Environment Variables, I would add APT_COMPRESS_BOUNDED_FIELDS variable under “Reading and Writing Files” or, perhaps, create a new category called “Performance” for VARCHAR handling.  Using this parameter can help with the overall performance of jobs having a significant number of Varchar fields, especially, large Varchar fields.  This parameter can also reduce sort and dataset storage space consumption within parallel jobs. This parameter is applicable to IBM Infosphere Information Server 8.0.1 fp3 and more recent versions.

To enable the parameter to ensure the APT_COMPRESS_BOUNDED_FIELDS is set to 1 in your project environment variables, you may need to add the variable. I used these properties:

User definedAPT_COMPRESS_BOUNDED_FIELDSStringVARCHAR Compression for Performance1

To further improve performance:

  • Set VARCHAR field length to zero. This will avoid column padding, thereby, reducing I/O and space consumed by Datasets and Sorts
  • Avoid writing to files, if possible.
  • Avoid the use of Datasets, if possible; especially, persistent datasets.
  • Avoid the use of sorts within the Datastage job, if possible.
  • Pay attention to your partitioning strategy within the Datastage job to avoid unintentional sort operations.