When HyperPAV was announced, the extinction of IOSQ was expected to follow shortly. And indeed for most customers IOSQ time is more or less an endangered species. Yet in some cases a bit of IOSQ remains, and even queuing on HyperPAV aliases may be observed. The reflexive reaction from a good performance analyst is to interpret the queuing as a shortage that can be addressed by adding more Hypers. But is this really a good idea? Adding aliases will only increase overhead and will decrease WLM’s ability to handle important I/Os with priority. Let me explain why.
HyperPAV, like many I/O related things in z/OS, works on an LCU basis. LCUs are a management concept in z/OS: each LCU can support up 8 channels for data transfer, and up to 256 device addresses. With HyperPAV, some of the 256 addresses are used for regular volumes (“base addresses”), and some are available as “aliases”. You do not need to use all 256 addresses; it is perfectly valid to have no more than 64 base addresses and 32 aliases in an LCU.
Unlike SCSI devices that use tag command queuing, z/OS I/O devices were historically designed to only handle one I/O operation per device address at a time. HyperPAV circumvents this issue by assigning alias addresses “on demand” when additional I/O operations arrive for a particular device address. This way, a single logical volume (“volser”) can handle many I/O operations at any point in time, up to the number of available aliases. Thus it seems logical that any signs of IOSQ could be resolved by simply adding aliases. However this ignores other constraints.
The total number of I/O operations that can be handled concurrently by an LCU is constrained by the number of devices and the number of FICON channels. Devices are logical resources for which the work is spread over very many physical back-end drives, but FICON channels are real resources, using real host interface boards in the storage systems. Both the channels and the interface boards have limited processing and data transfer capability.
When 32 aliases and (say) 16 base addresses are used, this means that 48 concurrent I/O operations are supported for a particular LCU, or 6 operations per FICON channel (48 concurrent I/Os divided by 8 FICON channels). As typically most operations are handled at cache speed (read hits, sequential reads, writes) there is really no point in starting even more operations at the same time by increasing the number of aliases. You will only create internal queuing inside the storage system, resulting in less efficient operation and higher pending, connect and disconnect times. You may reduce the IOSQ time, but you pay for it elsewhere.
Another commonly used argument is that more aliases (than 32, or any number) are needed, because RMF reports that sometimes all aliases are used, regardless of whether there is visible IOSQ time. In many cases that we have investigated, these ‘all hypers used’ conditions are caused by DB2 buffer flushes, where DB2 flushes very many buffers at the expiration of an interval. DB2 queues up so many I/O requests that all aliases are exhausted, connect time will explode, and it will take the storage system quite a while to recover. ‘A while’ may be only 0.1 second, but during that 0.1 second no other work can be started since all aliases are claimed by this essentially asynchronous DB2 work. When 32 aliases are defined, DB2 will not be able to claim more than 32, and any excess I/Os are queued. That may seem bad, but it means that when more important work comes in, the Workload Manager can still give these I/Os priority! And the DB2 I/Os buffer flush I/Os are asynchronous to begin with, so they should not be allowed to monopolize a storage system.
Having fewer aliases means that your storage system may be driven into saturation by short spikes but z/OS WLM will be able to give priority to new I/Os, and thus it is harder in general for a single application to dominate an LCU.
Our recommendation is that the number of aliases be based on your assessment how many I/Os you will be able to process concurrently on a channel set. When making this estimate, please consider that multiple LCUs tend to share the same sets of ports, further reducing the maximum throughput that is possible. It is very unlikely that this analysis will give a higher number than 32.
There is one exception to this rule. When connect time is only a small portion of the total service time it may make sense to increase the aliases. This may happen for example because of high disconnect for read-miss, or because of I/O to remote sites where pending time is elevated due to distance. In these rare instances it does make sense to try and get more concurrent I/Os started on the FICON links by using more aliases.
Estimating Storage System Capabilities Should not be a Risky Business!
If you want a useful headroom metric you need to define it properly.
How to Avoid Application Infrastructure Performance Problems
"What are the top 5 million things you need to do today to avoid application infrastructure performance problems?"
KBC Ensures Data Storage Availability with IntelliMagic Vision
Learn how the Belgian bank-insurance group KBC ensures data storage availability using IntelliMagic Vision