A prominent theme among IT organizations today is an intense focus on expense reduction. For mainframe departments, this routinely involves seeking to reduce IBM Monthly License Charge (MLC) software expense, which commonly represents the single largest line item in their budget.
This is the second article in a four-part series focusing largely on a topic that has the potential to generate significant cost savings but which has not received the attention it deserves, namely processor cache optimization. (Read part one here). Without an understanding of the vital role processor cache plays in CPU consumption and clear visibility into the key cache metrics in your environment, significant opportunities to reduce CPU consumption and MLC expense may not be realized.
This article focuses on changes to LPAR configurations that can improve cache efficiency, as reflected in lower RNI values. The two primary aspects covered will be optimizing LPAR topology, and increasing the amount of work executing on Vertical High (VH) CPs through optimizing LPAR weights. Restating one of the key findings of the first article, work executing on VHs optimizes processor cache effectiveness, because its 1-1 relationship with a physical CP means it will consistently access the same processor cache.
PR/SM dynamically assigns LPAR CPs and memory to hardware chips, nodes, and drawers seeking to optimize cache efficiency. This topology can have a very significant impact on processor performance because remote cache accesses can take hundreds of machine cycles. Figure 1 provides a framework for the following discussion.
When a unit of work executing on a CP on a given Single Chip Module (SCM) accesses data in L3 cache (the first level that is shared and thus part of the “nest” and RNI metric), that access can be on its own SCM chip, “on node” (on a different SCM within this node), “on drawer” (on the other node in this drawer), or “off drawer”. The more remote the access, the greater the number of machine cycles required, often hundreds of cycles. Similarly, access to L4 cache and to memory can be “on node”, “on drawer”, or “off drawer.”
A first example showing the impact of LPAR topology on RNI begins with Figure 2.
In this scenario, the logical CPs from several LPARs are competing for physical CPs on three chips. Green and orange shading has been added to compare two primary Production systems that execute similar data sharing workloads, each having three VHs and two VMs.
Note that the VHs and VMs for System 3 (shaded green) are collocated on the same chip. The “RNI by Logical CP” chart for this system appears in Figure 3 below. The RNIs for the two VM CPs on that system (CPs 6 and 8, the green and light blue lines at the top of the chart), were higher than for the VHs, but not dramatically so.
A second LPAR topology scenario involves an entirely different kind of opportunity. In the use case depicted in Figure 8, PR/SM configured all ten VH CPs from two Production LPARs in the same node on a single drawer. The outcome of this configuration was that the two LPARs were sharing the 480 MB L4 local cache of that single node between them.
Maximizing Work Executing on VHs
A second way to reduce RNI is to maximize work executing on VHs. The two variables determining the Vertical CP configuration are (1) LPAR weights and (2) the number of physical CPs. The remainder of this article will cover how LPAR weights can be adjusted to maximize work executing on VHs. The third article of the series will address options and considerations for increasing the number of physical CPs.
There are several ways to maximize work on VHs through setting LPAR weight values. One is to adjust LPAR weights to increase the number of VHs for high CPU LPARs that currently have a significant workload executing on VMs and possibly even VLs. In Figure 10, a very small weight change on a large LPAR changing the LPAR weight percentage from 70 to 71 percent increased the number of VHs from seven to eight.
This resulted in a measured decrease in RNI of 2% for a given measurement interval, which correlated to a CPU reduction of 1%. 1% less CPU on a large LPAR can translate into a meaningful reduction in MLC software expense, especially when compared with the level of effort required to identify and implement this type of change. Benefits from tuning LPAR weights typically produce single-digit percentage improvements as in this case, but there can be larger opportunities as we will now see.
A second way to increase work executing on VHs involves tailoring LPAR weights to increase the overall number of VHs assigned by PR/SM on a processor. The LPAR weight configuration of 30/30/20/20% in Figure 11 appears routine, but unfortunately, on a z13 it results in zero VHs.
Optimizing processor cache can have a particularly big impact on CPU consumption for z13 and z14 processors, which are more sensitive than ever before to cache effectiveness. In the next article in this series, we will explore options and considerations relating to the number of physical CPs that can reduce RNI and CPU consumption and MLC expense.
Read part 3 here: Optimizing MLC Software Costs with Processor Configurations
[Havekost2017a] Todd Havekost, Impact of Processor Cache Optimization on MLC Software Costs, Enterprise Tech Journal, 2017: Issue 4.
[Sinram2015] Horst Sinram, z/OS Workload Management (WLM) Update for IBM z13, z/OS V2.2 and V2.1, SHARE Session #16818, March 2015.
[Havekost2017b] Todd Havekost, Beyond Capping: Reduce IBM z13 MLC with Processor Cache Optimization, Share Session #20127, March 2017.
[Snyder2016] Bradley Snyder, z13 HiperDispatch – New MCL Bundle Changes Vertical CP Assignment for Certain LPAR Configurations, IBM TechDoc 106389, June 2016.
 For background on key metrics and concepts such as Cycles Per Instruction (CPI), Relative Nest Intensity (RNI), HiperDispatch, and vertical CP configurations, see the first article in the series [Havekost2017a].
 LPAR topology data is provided by the SMF Type 99 Subtype
14 record. As opposed to some SMF 99 subtypes which can generate overwhelming volumes, the volume of subtype 14 data is very manageable, one record per logical CP every five minutes, making this another data source that warrants collection and analysis.
 IBM specialists assisting at my former employer identified this opportunity and the workaround to create the desired topology, and measured the increase in effective capacity, which correlated well with estimated CPU savings derived from the RNI metric.
 The specifics of how PR/SM determines Vertical CP assignments based on LPAR weights and the number of physical CPs is beyond the scope of this article [for details see Havekost2017b].
 This is the configuration after an IBM z13 microcode change released June 2016 [see Snyder2016 for details].
z/OS Performance Monitoring and more at SHARE Phoenix
A look back at SHARE Phoenix with links to all of the presentations and sessions we hosted.
Reduce CPU and MLC for z13 and z14 Processors by Optimizing Processor CacheDownload
Understanding & Dealing with z14 Traffic Patterns
The z14 is designed for massive, parallel processing. So why do delays still occur? This webinar will explore common sources of application delays and discuss practical solutions to reduce these delays.
How to use Processor Cache Optimization to Reduce z Systems Costs
Optimizing processor cache can significantly reduce CPU consumption, and thus z Systems software costs, for your workload on modern z Systems processors. This paper shows how you can identify areas for improvement and measure the results, using data from SMF 113 records.
Subscribe to our Newsletter
Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.