Performance tuning on IBM i: finding bottlenecks and reading query plans

Performance problems on IBM i tend to fall into a small number of categories. A query that reads every record in a million-row table when it only needs ten. A batch job that holds a lock for twenty minutes while processing records one at a time. A program that makes the same database call in a loop a thousand times. An output queue backed up because a job is printing ten thousand pages that nobody asked for.

The good news is that IBM i gives you more visibility into what is happening than most platforms do. The tools to find a bottleneck are built in, the query optimiser records its decisions, and the system tracks performance statistics at a level of detail that makes root cause analysis genuinely possible.

This post covers the practical approach to finding and fixing performance problems on IBM i — where to look first, how to read what the system tells you, and the changes that make the most difference.

Start with the right question

Before reaching for any tool, get specific about the problem. “The system is slow” is not actionable. These questions are:

Which specific job, program, or query is slow?
How long does it take now, and how long should it take?
Is it always slow, or only at certain times of day?
Did it used to be faster? What changed?

A problem that only appears at 9am on Monday is a concurrency or resource contention problem, not a code problem. A query that runs in two seconds interactively but thirty seconds in batch is probably a job priority or isolation level issue. Getting specific before you start tuning saves hours.

WRKACTJOB — find the hot job

When something is actively slow right now, start here. Sort by CPU to see what is consuming resources:

WRKACTJOB SRTFLD(*CPUPCT)

Look for:

A job consuming disproportionate CPU — likely a full table scan or a tight loop
Multiple jobs in LCKW — lock contention, one job blocking many others
A job stuck in MSGW — waiting for operator reply, holding resources while it waits
Unusually high database I/O in the function column

Press F11 from WRKACTJOB to cycle through different views — one shows database reads, another shows total I/O operations. A job with millions of database reads in a short time is almost certainly doing a table scan where it should be using an index.

WRKSYSSTS — the system pressure gauges

WRKSYSSTS gives you the system-level picture:

WRKSYSSTS

The numbers to watch:

% CPU used — if this is consistently above 80%, CPU is the constraint. Find the heavy jobs with WRKACTJOB.

Database faults / non-database faults — page faults: the system had to read data from disk because it was not in memory. High database faults mean the working set is too large for available memory, or a full table scan is pulling in enormous amounts of data.

Pool fault rates — WRKSYSSTS shows multiple memory pools. If the *BASE or *INTERACT pool has a high fault rate, that pool is undersized for its workload. Increasing pool size reduces faults and speeds up those jobs.

Visual Explain — read the query plan

When a specific SQL query or embedded SQL statement is slow, you need to see what the optimiser decided to do. The tool for this is Visual Explain, available in ACS (IBM Access Client Solutions).

In ACS: open the SQL console, write your query, and instead of running it choose Run → Explain SQL. Visual Explain draws a graphical tree of how DB2 will execute the query.

What to look for:

Table scan vs index scan — the single most important thing. A table scan reads every row in the table and filters afterwards. An index scan jumps directly to the relevant rows. On a million-row table, the difference is between a two-second query and a two-minute one. Visual Explain marks table scans clearly — if you see one on a large table, you need an index on the columns in your WHERE clause.

Estimated rows vs actual rows — if the optimiser estimated 10 rows but actually processed 500,000, its statistics are stale. Refresh them:

ANZOBJPFR OBJ(MYLIB/ORDERPF) OBJTYPE(*FILE)

Join order and method — for multi-table queries, check whether the optimiser is joining large tables to small ones in an efficient order. If the plan looks wrong, stale statistics are the most common cause.

PRTSQLINF — the green-screen alternative

If you do not have ACS, use PRTSQLINF to print the access plan for every SQL statement in a compiled program:

PRTSQLINF PGM(MYLIB/MYPGM)

This produces a spooled report showing which index the optimiser used for each statement (or that it did a table scan), the estimated row count, and the optimisation method. Less visual than ACS but always available, including over a 5250 session.

Creating the right indexes

Most query performance problems are missing index problems. Once you know which columns appear in the WHERE clause and ORDER BY of a slow query, the fix is usually a new index.

-- A query that filters by status and date range
SELECT OrderID, CustID, OrderTotal
  FROM MYLIB.ORDERPF
  WHERE Status = 'OPEN'
    AND OrderDate >= :StartDate
  ORDER BY OrderDate DESC

This benefits from a composite index on (Status, OrderDate):

CREATE INDEX MYLIB.ORDERPF_STATUS_DATE
  ON MYLIB.ORDERPF (Status, OrderDate DESC)

Index column order matters. Put the equality filter column first (Status = 'OPEN'), then the range or sort column (OrderDate). The optimiser uses an index efficiently when the leading columns match the filter pattern.

After creating the index, run Visual Explain again to confirm the optimiser picked it up. If it did not, check that table statistics are current.

Index Advisor — let the system tell you what to create

DB2 for i tracks every query that could have used an index but did not have one. The Index Advisor accumulates these recommendations over time and exposes them through SQL:

SELECT TIMES_ADVISED,
       ESTIMATED_PROCESSING_TIME_SAVED,
       SYSTEM_TABLE_SCHEMA,
       SYSTEM_TABLE_NAME,
       INDEX_COLUMNS_ADVISED,
       LAST_ADVISED
  FROM QSYS2.SYSIXADV
  ORDER BY TIMES_ADVISED DESC
  FETCH FIRST 20 ROWS ONLY

TIMES_ADVISED shows how many times the system wanted this index. ESTIMATED_PROCESSING_TIME_SAVED shows cumulative time that could have been saved. Sort by TIMES_ADVISED descending and work through the top entries — these are the highest-value indexes you are missing on your busiest tables.

The N+1 problem — database calls in a loop

The single most common RPG performance problem. The program reads a list of orders, then for each order does a separate CHAIN to look up the customer. A thousand orders means a thousand CHAINs — a thousand separate database round-trips.

The fix is always the same: replace the loop with a join.

-- Instead of: read orders in a loop, CHAIN customer for each one
-- Do this: one query that joins both tables upfront

exec sql
  declare OrderCursor cursor for
    select o.OrderID, o.OrderTotal, o.OrderDate,
           c.CustName, c.CustEmail, c.CreditLimit
      from MYLIB.ORDERPF  o
        join MYLIB.CUSTMAST c on c.CustID = o.CustID
      where o.Status = 'OPEN'
      order by o.OrderDate;

One database call instead of one thousand. On large datasets this is routinely a 50x to 100x improvement.

Commitment control on high-volume updates

If you are inserting or updating thousands of records with commitment control active, every change generates a journal entry. Committing too frequently adds overhead; committing too infrequently holds locks for too long and creates a large rollback exposure.

The right pattern is batch commits:

dow SQLCODE = 0;
  exec sql UPDATE MYLIB.ORDERPF
    SET ProcessedFlag = '1'
    WHERE OrderID = :CurrentOrderID;

  Counter += 1;
  if Counter >= 500;
    exec sql COMMIT;
    Counter = 0;
  endif;

  exec sql FETCH OrderCursor INTO :CurrentOrderID;
enddo;

exec sql COMMIT; // Final commit for remaining records

Commit every 500 to 1000 records for bulk operations. This balances journal overhead against lock duration and rollback exposure.

Avoiding repeated file opens

Every time an RPG program opens a file there is overhead — the system resolves the object, checks authorities, and allocates internal structures. Opening a file inside a loop multiplies that overhead by the iteration count.

Open files once at program start, use them throughout, close at the end. For service programs called in a loop by a driver program, use actgrp(*caller) so the service program shares the caller’s activation group and files stay open between calls.

Optimiser messages — why did it choose that plan?

When the optimiser makes a surprising decision, it sometimes leaves messages in the job log explaining why. Enable optimiser messages for a job:

CHGQRYA QRYOPTLIB(*SAME) OPTIMIZE(*YES)

Then query the job log:

SELECT MESSAGE_ID, MESSAGE_TEXT, MESSAGE_TIMESTAMP
  FROM TABLE(QSYS2.JOBLOG_INFO('*')) AS X
  WHERE MESSAGE_ID LIKE 'CPI4%'
  ORDER BY MESSAGE_TIMESTAMP DESC

CPI4xxx messages are optimiser notifications — they tell you when it chose a table scan, when it built a temporary index, when statistics were too stale to use, and what alternative plans it considered. These messages often explain exactly why a plan is suboptimal.

Collection Services — tracking performance over time

For understanding workload patterns across hours or days rather than diagnosing a single slow query, Collection Services continuously gathers system-wide performance data at configurable intervals.

Start a collection:

STRPFRMON INTV(5) TYPE(*MGTCOL)

This collects data every 5 minutes. Data is stored in a performance database under QMPGDATA and can be queried with SQL or analysed with IBM’s Performance Tools licensed program. Collection Services shows CPU trends, memory pressure over time, peak batch windows, and which subsystems consume the most resources across a full business day.

Use Collection Services when the problem is not a single slow job but a general sense that the system gets congested at predictable times.

The performance tuning checklist

Work through this in order when a program or query is slow:

Table scan? Check Visual Explain or PRTSQLINF. If yes — create the missing index.
Index Advisor recommendations? Query QSYS2.SYSIXADV and create the top-advised indexes.
Stale statistics? Run ANZOBJPFR on the affected table.
Database call in a loop? Consolidate into a join or a single query with a cursor.
Lock contention? Use WRKOBJLCK to find the blocking job.
Commitment control overhead? Switch to batch commits every 500-1000 records.
Memory pool pressure? Check fault rates in WRKSYSSTS and resize pools if needed.
Time-of-day problem? Check what else runs concurrently with WRKJOBSCDE.

Most performance problems yield to steps 1 through 4. The system has the answers — the tools above are how you read them.

Next post: Security on IBM i — object authority, user profiles, and keeping your data safe.