Expense reduction initiatives among IT organizations typically prioritize efforts to reduce IBM Monthly License Charge (MLC) software expense, which commonly represents the single largest line item in the mainframe budget.
On current (z13 and z14) mainframe processors, at least one-third and often more than one-half of all machine cycles are spent waiting for instructions and data to be staged into level one processor cache so that they can be executed. Since such a significant portion of CPU consumption is dependent on processor cache efficiency, awareness of your key cache metrics and the actions you can take to improve cache efficiency are both essential.
This is the final article in a four-part series focusing on this vital but often overlooked subject area. (You can read Article 1, Article 2, and Article 3.) This article examines the changes in processor cache design for the z14 processor model. The z14 reflects evolutionary changes in processor cache from the z13 in contrast to the revolutionary changes that occurred between the zEC12 and z13. The cache design changes for the z14 were particularly designed to help workloads that place high demands on processor cache. These “high RNI” workloads frequently experienced a negative impact when migrating from the zEC12 to z13.
z14 Cache Design Changes – Overview
The rationale and impact of these significant cache changes on the z14 will be examined in the rest of this article.
- A unified L4 cache enables point-to-point access to remote drawers.
- Important changes to the PR/SM LPAR algorithms were made to reduce cache waiting cycles.
- Selected cache sizes on the z14 were increased.
- The level 1 Translation Lookaside Buffer (TLB) was merged into level 1 cache.
Unified Level 4 Cache
Figure 1 shows the z14 drawer cache hierarchy diagram, which in levels 1 through 3 is largely similar to the z13.
But on the z13, depending on the node in which the data resides, remote accesses could now require multiple hops, thus requiring hundreds of machine cycles. This is one key factor explaining why LPAR topology plays such a major role in RNI on the z13, and why some high RNI workloads had performance challenges migrating to z13 processors. The z14 design unifying L4 cache in the System Controller restores point-to-point connectivity between all the L4 caches.
PR/SM Algorithm Changes
Partnering with the unified L4 cache that reduced the cycles required by cross-drawer accesses, significant changes were made to PR/SM algorithms to reduce the frequency of those remote accesses. These two algorithm changes are particularly noteworthy.
- One change prioritizes the effort to fit an LPAR within a single drawer, seeking to reduce the frequency of expensive remote accesses. The increase from 8 to 10 cores per chip makes this more achievable by increasing the size of an LPAR that will now fit within a drawer.
- A second change prioritizes placing General Purpose logical CPs (GCPs), Vertical Highs (VHs) and Vertical Mediums (VMs), in close proximity on the same chip or cluster. This is in contrast to the z13, which prioritized co-locating VHs for both GCPs and zIIPs. This frequently occurred at the expense of VM GCPs, which wound up located in a separate drawer, especially on large LPARs.
Reducing the “RNI Penalty”: Case Study
When PR/SM places VM logical CPs in a different drawer than the VHs for that LPAR (a common occurrence on z13 processors), this usually created a sizable RNI penalty for work executing on those VMs. This occurred because almost all their cache accesses traveled across drawers – very expensive in terms of machine cycles on the z13.
As we have seen, the unified L4 cache and PR/SM algorithm changes were designed to reduce the magnitude of that RNI penalty. This case study quantifies the impact of those changes at one site.
When this LPAR was running on the z13 (Figure 5), the VMs (in orange) were in a separate drawer (Drawer 2) from all the VHs (in green, in Drawer 3). This led to frequent cross-drawer accesses by the VMs, causing a very significant RNI penalty.
The outcome of this case study provides evidence that the PR/SM algorithm changes had their intended effect, reducing the RNI penalty for VMs dramatically because of improved proximity in the LPAR topology.
Increases in Selected z14 Cache Sizes
The z14 provides several cache size increases, captured in Figure 7. Level 1 instruction cache became one third larger, level 2 data cache doubled in size, and level 3 cache also doubled.
Constraints in processor chip real estate limit the opportunities in increased cache sizes. I expect IBM leveraged its machine-level instrumentation to deploy these increases in the tiers of the cache hierarchy where they would have the biggest overall benefit for reducing cache misses requiring accesses to subsequent levels of cache or even memory.
Initial metrics from a z14 implementation show good results from these cache size increases. The Level 1 Miss Percentage (L1MP, one of the two primary variables in IBM’s LSPR chart classifying workloads as High, Medium, or Low complexity) decreased by approximately 15%, likely reflecting the increased size of the level 1 Instruction cache as well as other z14 architectural improvements.
And the estimated lifetime of data in processor cache increased by at least 50% in levels 2 and 3 at this z14 site. It also improved slightly in level 4, despite the reduction in aggregated cache at that level. These outcomes will also translate into lower cache miss percentages and thus better RNI.
TLB Design Enhancements
The final design change listed earlier was that the level 1 Translation Lookaside Buffer (TLB1) was “merged” into level 1 cache. The TLB performs the critical and high-frequency function of translating virtual addresses into real addresses. On the z14, the level 1 cache contains all the data needed to perform that address translation function, eliminating the need for a separate TLB1 and the potential for any TLB1 misses.
Again, initial metrics from a z14 implementation show very good results from these TLB design enhancements. The changes reduced the total CPU consumed for TLB misses by more than half.
This series of articles has demonstrated that processor cache performance plays a more prominent role than ever before in the capacity delivered by z13 processors. And though z14 users can welcome the fact that many of the cache pain points experienced by high RNI workloads have been improved by z14 design changes, continuity is generally the case for processor cache between the two models.
For both z13 and z14 processors, unproductive cycles spent waiting to stage data into L1 cache continue to represent one third to one half of overall CPU. The zEC12 and earlier era where processor speed “covered a multitude of sins” and mainframe performance analysts could get by with ignoring processor cache metrics is gone, at least for the foreseeable future.
Another trend that continues unabated with the arrival of the z14 is that software expenses continue to consume a growing percentage of the overall mainframe budget, while hardware costs represent an ever-smaller percentage. Considering these factors, these articles have encouraged readers to re-examine the traditional assumption of mainframe capacity planning that running mainframes at high utilizations is the most cost-effective way to operate. Considering this changing reality, readers are advised to take a proactive initiative to renegotiate ISV contracts to be usage-based. This will place you in the enviable position of having the flexibility to select hardware configurations that achieve the lowest overall total cost of ownership.
Finally, this series of articles has sought to help performance analysts realize the importance of having a solid understanding of, and clear visibility into, key processor metrics. Armed with this understanding and visibility, there will often be opportunities to achieve significant CPU (and MLC) savings by leveraging those metrics to optimize processor cache. Hopefully, the methods presented to reduce RNI and CPU consumption, and the case studies confirming the effectiveness of these methods in real-life situations, have equipped readers to identify and implement optimizations that will improve the efficiency, competitiveness, and long-term viability of the mainframe platform in their environments.
Finding Hidden MLC Reduction Opportunities
For more information on finding opportunities to reduce MLC expenses in your environment even after capping, feel free to view my recorded technical webinar, Achieving CPU (& MLC) Savings on z13 and z14 Processors by Optimizing Processor Cache, in which previous iterations won a 2016 SHARE Best Presentation and a 2017 CMG Best Paper.
And if you are interested in a no obligation MLC Reduction Assessment utilizing data from your own environment, you can request additional information about that here.
How to use Processor Cache Optimization to Reduce z Systems Costs
Optimizing processor cache can significantly reduce CPU consumption, and thus z Systems software costs, for your workload on modern z Systems processors. This paper shows how you can identify areas for improvement and measure the results, using data from SMF 113 records.
Despite the benefits - zIIP processors are typically less expensive to purchase and also run at full speed on sub-capacity CPU models - zIIP processor usage may not always be what one hopes for.
Managing the Gap in IT Spending and Revenue Growth in your Capacity Planning Efforts
Capacity planning is important, but don’t let the importance of future capacity needs for the budget drive you to overlook opportunities to building a bridge toward better performance and longer term efficiencies.
Throwing out the Rolling Four-Hour Average with Tailored Fit Pricing, Enterprise Consumption
The Rolling 4-Hour Average (R4HA) is a measurement of your average speed, over four hours; Tailored Fit Pricing measures precisely how far you drove.