Todd Havekost - 2 April 2018

Expense reduction initiatives among IT organizations typically prioritize efforts to reduce IBM Monthly License Charge (MLC) software expense, which commonly represents the single largest line item in the mainframe budget.

On current (z13 and z14) mainframe processors, at least one-third and often more than one-half of all machine cycles are spent waiting for instructions and data to be staged into level one processor cache so that they can be executed. Since such a significant portion of CPU consumption is dependent on processor cache efficiency, awareness of your key cache metrics and the actions you can take to improve cache efficiency are both essential.

This is the final article in a four-part series focusing on this vital but often overlooked subject area. (You can read Article 1, Article 2, and Article 3.) This article examines the changes in processor cache design for the z14 processor model. The z14 reflects evolutionary changes in processor cache from the z13 in contrast to the revolutionary changes that occurred between the zEC12 and z13. The cache design changes for the z14 were particularly designed to help workloads that place high demands on processor cache. These “high RNI” workloads frequently experienced a negative impact when migrating from the zEC12 to z13.

z14 Cache Design Changes – Overview

The rationale and impact of these significant cache changes on the z14 will be examined in the rest of this article.

  1. A unified L4 cache enables point-to-point access to remote drawers.
  2. Important changes to the PR/SM LPAR algorithms were made to reduce cache waiting cycles.
  3. Selected cache sizes on the z14 were increased.
  4. The level 1 Translation Lookaside Buffer (TLB) was merged into level 1 cache.

Unified Level 4 Cache

Figure 1 shows the z14 drawer cache hierarchy diagram, which in levels 1 through 3 is largely similar to the z13.

 z14 Drawer cache hierarchy diagram (IBM)

Figure 1. z14 Drawer cache hierarchy diagram (IBM)

But one very significant difference from the z13 is the unified L4 cache, which plays a key role in the enhancements of the z14. The rationale for this change is to reduce the cost of “remote” cross-drawer accesses.

Drawer Interconnect technologies

Figure 2. Drawer Interconnect technologies

Figure 2 compares the drawer interconnect technologies for recent z processors. zEC12 (and prior models) provided point-to-point connectivity between all “books” (now drawers), enabling accesses to remote cache to be reached directly.

But on the z13, depending on the node in which the data resides, remote accesses could now require multiple hops, thus requiring hundreds of machine cycles. This is one key factor explaining why LPAR topology plays such a major role in RNI on the z13, and why some high RNI workloads had performance challenges migrating to z13 processors. The z14 design unifying L4 cache in the System Controller restores point-to-point connectivity between all the L4 caches.

PR/SM Algorithm Changes

Partnering with the unified L4 cache that reduced the cycles required by cross-drawer accesses, significant changes were made to PR/SM algorithms to reduce the frequency of those remote accesses. These two algorithm changes are particularly noteworthy.

  1. One change prioritizes the effort to fit an LPAR within a single drawer, seeking to reduce the frequency of expensive remote accesses. The increase from 8 to 10 cores per chip makes this more achievable by increasing the size of an LPAR that will now fit within a drawer.
  2. A second change prioritizes placing General Purpose logical CPs (GCPs), Vertical Highs (VHs) and Vertical Mediums (VMs), in close proximity on the same chip or cluster. This is in contrast to the z13, which prioritized co-locating VHs for both GCPs and zIIPs. This frequently occurred at the expense of VM GCPs, which wound up located in a separate drawer, especially on large LPARs.

Reducing the “RNI Penalty”: Case Study

When PR/SM places VM logical CPs in a different drawer than the VHs for that LPAR (a common occurrence on z13 processors), this usually created a sizable RNI penalty for work executing on those VMs. This occurred because almost all their cache accesses traveled across drawers – very expensive in terms of machine cycles on the z13.

As we have seen, the unified L4 cache and PR/SM algorithm changes were designed to reduce the magnitude of that RNI penalty. This case study quantifies the impact of those changes at one site.

RNI by Logical CP – z13

Figure 3. RNI by Logical CP – z13

Figure 3 presents the RNI by logical CP for a large LPAR when it was running on a z13. Note the very significant RNI penalty for the VMs (along the top of the chart) compared to the VHs. For this LPAR, the RNI for work executing on the VMs was 87% higher than on VHs. Large LPARs on z13 processors frequently incurred a substantial RNI penalty where the LPAR topology often located VMs in a separate drawer from the large number of VHs.

RNI by Logical CP – z1

Figure 4. RNI by Logical CP – z1

Figure 4 displays the significant change when this LPAR migrated to a z14. The RNI penalty for the VMs was dramatically reduced, from 87% on the z13 (above) to 23% on the z14.

LPAR Topology – z13.

Figure 5. LPAR Topology – z13.

Diagrams of the underlying LPAR topologies on the z13 and z14 confirm that the previously described changes in the PR/SM algorithms had their desired outcomes.

When this LPAR was running on the z13 (Figure 5), the VMs (in orange) were in a separate drawer (Drawer 2) from all the VHs (in green, in Drawer 3). This led to frequent cross-drawer accesses by the VMs, causing a very significant RNI penalty.

LPAR Topology – z14

Figure 6. LPAR Topology – z14

Figure 6 shows the changes to the topology after this LPAR migrated to the z14. Now the VMs (in orange) and VHs (in green) reside in the same drawer and even share the same L3 cache. In fact, PR/SM was able to configure the entire LPAR (GCPs and zIIPs) in a single drawer.

The outcome of this case study provides evidence that the PR/SM algorithm changes had their intended effect, reducing the RNI penalty for VMs dramatically because of improved proximity in the LPAR topology.

Increases in Selected z14 Cache Sizes

The z14 provides several cache size increases, captured in Figure 7. Level 1 instruction cache became one third larger, level 2 data cache doubled in size, and level 3 cache also doubled.

z14 cache size increases

Figure 7. z14 cache size increases

As mentioned earlier, level 4 cache on the z14 has been unified, enabling point-to-point connectivity between all the level 4 caches. IBM’s analysis says this reduced latency for remote cache accesses more than offsets the impact of reducing the size of level 4 cache from 960 MB spread across two separate nodes to a single 672 MB cache.

Constraints in processor chip real estate limit the opportunities in increased cache sizes. I expect IBM leveraged its machine-level instrumentation to deploy these increases in the tiers of the cache hierarchy where they would have the biggest overall benefit for reducing cache misses requiring accesses to subsequent levels of cache or even memory.

Initial metrics from a z14 implementation show good results from these cache size increases. The Level 1 Miss Percentage (L1MP, one of the two primary variables in IBM’s LSPR chart classifying workloads as High, Medium, or Low complexity) decreased by approximately 15%, likely reflecting the increased size of the level 1 Instruction cache as well as other z14 architectural improvements.

And the estimated lifetime of data in processor cache increased by at least 50% in levels 2 and 3 at this z14 site. It also improved slightly in level 4, despite the reduction in aggregated cache at that level. These outcomes will also translate into lower cache miss percentages and thus better RNI.

TLB Design Enhancements

The final design change listed earlier was that the level 1 Translation Lookaside Buffer (TLB1) was “merged” into level 1 cache. The TLB performs the critical and high-frequency function of translating virtual addresses into real addresses. On the z14, the level 1 cache contains all the data needed to perform that address translation function, eliminating the need for a separate TLB1 and the potential for any TLB1 misses.

Again, initial metrics from a z14 implementation show very good results from these TLB design enhancements. The changes reduced the total CPU consumed for TLB misses by more than half.

Summary

This series of articles has demonstrated that processor cache performance plays a more prominent role than ever before in the capacity delivered by z13 processors. And though z14 users can welcome the fact that many of the cache pain points experienced by high RNI workloads have been improved by z14 design changes, continuity is generally the case for processor cache between the two models.

For both z13 and z14 processors, unproductive cycles spent waiting to stage data into L1 cache continue to represent one third to one half of overall CPU. The zEC12 and earlier era where processor speed “covered a multitude of sins” and mainframe performance analysts could get by with ignoring processor cache metrics is gone, at least for the foreseeable future.

Another trend that continues unabated with the arrival of the z14 is that software expenses continue to consume a growing percentage of the overall mainframe budget, while hardware costs represent an ever-smaller percentage. Considering these factors, these articles have encouraged readers to re-examine the traditional assumption of mainframe capacity planning that running mainframes at high utilizations is the most cost-effective way to operate. Considering this changing reality, readers are advised to take a proactive initiative to renegotiate ISV contracts to be usage-based. This will place you in the enviable position of having the flexibility to select hardware configurations that achieve the lowest overall total cost of ownership.

Finally, this series of articles has sought to help performance analysts realize the importance of having a solid understanding of, and clear visibility into, key processor metrics. Armed with this understanding and visibility, there will often be opportunities to achieve significant CPU (and MLC) savings by leveraging those metrics to optimize processor cache. Hopefully, the methods presented to reduce RNI and CPU consumption, and the case studies confirming the effectiveness of these methods in real-life situations, have equipped readers to identify and implement optimizations that will improve the efficiency, competitiveness, and long-term viability of the mainframe platform in their environments.

Finding Hidden MLC Reduction Opportunities

For more information on finding opportunities to reduce MLC expenses in your environment even after capping, feel free to view my recorded technical webinar, Achieving CPU (& MLC) Savings on z13 and z14 Processors by Optimizing Processor Cache, in which previous iterations won a 2016 SHARE Best Presentation and a 2017 CMG Best Paper.

And if you are interested in a no obligation MLC Reduction Assessment utilizing data from your own environment, you can request additional information about that here.

How to use Processor Cache Optimization to Reduce z Systems Costs

Optimizing processor cache can significantly reduce CPU consumption, and thus z Systems software costs, for your workload on modern z Systems processors. This paper shows how you can identify areas for improvement and measure the results, using data from SMF 113 records.

This article's author

Todd Havekost
Senior z/OS Performance Consultant
Read Todd's bio

Share this blog

Related

Blog

Mainframe Cost Savings Part 2: 4HRA, zIIP Overflow, XCF, and Db2 Memory

This blog covers several CPU reduction areas, including, moving work outside the monthly peak R4HA interval, reducing zIIP overflow, reducing XCF volumes, and leveraging Db2 memory to reduce I/Os.

Read more
Blog

Mainframe Cost Savings: Infrastructure Opportunities Part 1: Processor Cache

CPU optimization opportunities applicable across the infrastructure can often by implemented without the involvement of application teams and can benefit a significant portion (or all) of the work across the system.

Read more
Webinar

Don’t Keep Your CPU Waiting: Speed Reading for Machines | IntelliMagic zAcademy

This webinar discusses the many tiers of storage in IT systems and offers ideas about how to optimize access to those areas.

Watch Webinar

Go to Resources