Data Warehouse – Effective Practices


Effective Practices

Effective practices are enablers, which can improve performance, data availability, environment stability, resource consumption, and data accuracy.

Use of an Enterprise Scheduler

The scheduling service in InfoSphere Information Server (IIS) leverages the operating system (OS) scheduler, the common enterprise scheduler can provide these capabilities beyond those of a common OS scheduler:

  • Centralized control, monitoring, and maintenance of job stream processes
  • Improved insight into and control of cycle processes
  • Improved intervention capabilities, including alerts, job stream suspension, auto-restarts, and upstream/downstream dependency monitoring
  • Reduced time-to-recovery and increased flexibility in recovery options
  • Improved ability to monitor and alert for a mission-critical process that may be delayed or failing
  • Improved ability to automate disparate process requirements within and across systems
  • Improved load balancing to optimize the use of resources or to compensate for the loss of a given resource
  • Improved scalability and adaptability to infrastructure or application environment changes

Use of data Source Timestamps

When they exist or can be added to data, ‘created’ and ‘last updated’ timestamps can greatly reduce the impact of Change Data Capture (CDC) operations.  Especially, if the data warehouse, data model and load process store that last successful run time of CDC jobs. This reduces the number of rows required to be processed and reduces the load on the RDBMS and/or ETL application server.  Leveraging ‘created’ and ‘last updated’ can, also, greatly reduce the processing time required to perform the same CDC processes.

Event-Based Scheduling

Event-based scheduling, when coupled with an Enterprise scheduler, can increase data availability, distribute work opportunistically. Event-based scheduling can allow all or part of a process stream to begin as soon as predecessor data sources have completed the requisite processes.  This can allow processes to begin as soon as possible, which can reduce resource bottlenecks and contention. This, potentially, allows data to be made available earlier than a static time-based schedule.  Event-based scheduling can also delay processing, should the source system requisite processing completion be delayed; thereby, improving data accuracy in the receiving system.

Integrated RDBMS Maintenance

Integrating RDBMS Maintenance into the process job stream can perform on-demand optimization as the processes move through their flow, improving performance.  Items such as indexing, distribution, and grooming, maintenance at key points ensure that the data structures are optimized for follow on processes to consume.

Application Server and Storage  Space Monitoring and Maintenance

Monitoring and actively clearing disk space can not only improve overall performance, and reduce costs, but it also improves application stability.

Data Retention Strategies

Data Retention strategies, an often overlooked form of data maintenance, which deals with establishing policies ensure only truly necessary data is kept and that information by essential category, which is no longer necessary is purged to limit legal liability, limit data growth, storage costs, and improve RDBMS performance.

Use Standard Practices

Use of standard practices both, application and industry, allows experienced resources to more readily understand the major application activities, their relationships, dependency, design, and code.  This facilitates resourcing and support over the life cycle of the application.

How to Schedule an Infosphere Datastage Job in crontab


This is a quick, easy, shortcut way to schedule an InfoSphere DataStage job in Linux crontab to take advantage of capabilities within crontab not available in the InfoSphere graphical user interface (GUI).

For this example, the job has been adjusted from the stand InfoSphere scheduler graphical user interface (GUI) setting to run every 15 minutes, which is not available in the GUI.

The Basic crontab Scheduling Process

  • Schedule the job in DataStage Director
  • Login into Linux as the user, who created the schedule
  • Run ‘crontab -e’ command
  • Edit crontab command line using VI commands
  • Saves changes

Note: The revised schedule if different from the InfoSphere scheduler GUI standard setting will not display as changed in the GUI.  However, the jobs will run as scheduled, if edited correctly, and can be verified in the Director Client.

Related links:

What is Runbook Automation?


What is Run Book Automation?

Run Book Automation (RBA) is defining, building, orchestrating, managing and reporting automation on workflows which support system and network operational processes.

Related Topics

What is a Runbook?

What is an Enterprise Process Scheduler?

What is an Enterprise Process Scheduler?


What is an Enterprise Scheduler?

The enterprise scheduler can be thought of as a capstone or top of stack process orchestration application, which automates, monitors and controls workflow throughout the enterprise IT infrastructure.

Enterprise Process Scheduler Common Capabilities

Common Enterprise Scheduler Capabilities (not exhaustive) when compared to common operating system Schedulers (e.g. CRON, Windows Scheduler):

  • Built-in Reporting Capability.
  • Flexibility in recovery operations
  • Integration – Web Services, COM Interfaces
  • Intervention capabilities, including retries, alerts, job stream suspension, auto-restarts, and upstream/downstream dependency monitoring
  • Management Console capability
  • Multi-Platform Support (Windows, Unix /Linux, IBM, etc.)
  • Multiple Calendar Support (Fiscal, Gregorian, etc.)
  • Out of the box integration with other COTS systems (ERP, ETL, Database, etc.)
  • Queuing Concept for Load Management
  • Runbook Automation
  • Scripting Capability
  • Security – LDAP and Roles based security capability
  • Self Service Automation
  • single point of control, you can view and manage composite and/or cross-application, workloads
  • Workload Automation

Related Topics