Businesses often struggle trying to reduce mainframe and other IT infrastructure costs. Before you can get started reducing your mainframe costs, you should understand what drives those costs. To a large extent, how much you spend in total, software, hardware, and infrastructure management is driven by your consumption. What metric(s) does IBM provide to track and calculate how much you pay in software? How does that relate to hardware and infrastructure management?
Whether you are an experienced z/OS analyst, or someone new to the industry, you may be asked (or told) to recommend ways to reduce those costs. An important first principle is: KISS or Keep It Simple Stupid.
In this blog, I’ll cover the commonly used metric for IBM software licensing, and how you can track and report on your results using those measurements to help the business side of the house understand what’s behind the costs of what is often the largest line item in the IT budget. There are numerous ways to reduce your mainframe costs that we have covered in the past. I’ll link to those resources throughout this blog.
MSUs Consumed: The IBM Metric for Software Licensing
The primary metric for understanding mainframe costs is MSUs consumed. It provides a direct means to attribute software cost. While unrelated to hardware, data center, and mainframe management costs, there is a strong correlation. More consumption means bigger data centers, more power and more people needed to manage the systems. I would argue there are economies of scale, but for near term planning you won’t generally jump to a lower scale when it comes to costs.
The Sub Capacity Reporting Tool (SCRT) is still the standard by which most folks are paying IBM for z/OS software. To understand precisely how MSU’s consumed impacts and drives your software expense, you need to understand what pricing model you are currently under (primarily):
Traditional 4-Hour Rolling Average (4HRA/R4HA)
Newer Tailored Fit Pricing Models (TFP)
Regardless of which pricing model you use, MSU’s consumed is still the primary software pricing metric. The difference is mostly whether that consumption occurs for every hour covered in the licensing term (likely yearly or years), or for the peak four hour rolling average (4HRA), (monthly).
4-Hour Rolling Average Cost Reporting
Under the traditional 4HRA/R4HA methodology, the key time frame where mainframe costs attributed happens during the peak 4 hour rolling average for a month.
The reduction strategy here often includes identifying which work can be moved outside of peak periods to non-peak periods, which would decrease your overall 4HRA, thus reducing your software bill. Not
surprisingly, peak demand drives other costs too. It drives processor capacity allocations, (hardware cost & software) and it likely parallels other hardware and infrastructure costs similarly.
MSU’s Consumed Under Tailored Fit Pricing (TFP)
Under the newer TFP models, IBM charges based on total MSU consumption. This means that any MSU savings introduced can have a material effect on your software costs regardless of when it is consumed, peak 4HRA or not (the other 950+ hours per month).
While each mainframe shop will have different agreements in terms of time frames, entitlement targets, and over/under target assessments, a reduction in ongoing MSU demand will eventually net reduced software expenses.
For more information regarding reducing software costs, refer to these resources:
Assuming you have a stake or responsibility for the mainframe budget, you need to know your organization’s budgeted annual z/OS software bill. Your focus can become more intense when you recognize the annual z/OS software cost (SC) and the pricing model used to attribute that cost (tailored fit pricing, 4HRA). Furthermore, ask for the total annual mainframe budget (AMB), while you are at it.
Once you have these budget lines, you can do the math using MSU consumptions, to determine a pretty good approximation of software cost and total costs.
If your demand is 10,000 MSU, at peak 4HRA, and a resource reduction effort saves 100 MSU consistently at that time, the annual savings going forward is 100/10,000*100 = 1%.
One percent may not seem like much, but in a shop where the annual software budget for z/OS is $10M, that’s $100,000 of annual savings (SC).
Let the accountants determine the value of your investment and the timing of booking the savings. While I have long forgotten my brief education in accounting, I believe any reduction in expenses generally goes directly to earnings. This seems to get the attention of executives, whose compensation is often based on earnings targets.
Without much effort, this estimate can be extended beyond software costs by assuming you have the whole mainframe budget, including IBM software broken out separately. If you similarly attribute other mainframe costs to the MSU consumption, the whole mainframe budget, not just software may be estimated. While this may be a gross generalization, there may be some benefits to doing so.
It’s not a stretch to say that 1/3 of IT budgets are software expenses or (SC/AMB=1/3). That math would indicate the total savings of 1% over a long period of time could result in $300k annual savings. The math will be different for a different licensing model, but once you have MSUs (interval at peak 4HRA, or hourly consumption MSUs for TFP) you can get that percentage.
I’ll admit this is crude, and if you have more detail you want to include and use in your analysis, you can certainly do that. My tilt is toward the KISS principle so that I can spend my time on the real work of cost reduction, versus arguing over the accounting details. To put a finer point on this view, someone may contest that you haven’t considered capture ratio. Here’s my response:
If your capture ratio is around 95%, you are at most going to be 5% off with the reporting estimate, and the 5% difference won’t make or break the decision on whether to save 95% of what is on the table.
We should really focus our time on pursuing the savings rather than counting 100 out of 100 beans. Leave that for the bean counters.
How to Track Your MSU Consumption and Find Areas to Lower It
Since each IBM processor has a given MSU rating, attributing the consumption values over time is simply the central processor MSU rating * usage of the work in question/total usage.
Figure 1: Daily Interval MSU consumption (CICS highlighted vs. total) CICS is over ½ at peak Interval
Figure 2 – Percentages CICS Full Day vs Peak 4HRA
Figure 2 represents the Area Chart with 2 Pie charts. These show the importance of knowing your MLC pricing model. For a full day, CICS will approximate 41% of the total MSU consumption, whereas CICS represents 56% of the peak 4HRA for the same day.
If you were fortunate enough to achieve a 10% savings in MSU on the CICS workload, the application of that savings across the CICS workload at 41% vs. 56% makes a big difference in your savings estimate. MSU savings of 4.1% vs. 5.6% savings on a $10M annual software bill is $150,000. Getting the fundamental assumptions aligned with the business side – software pricing model – will help build your credibility when you’re presenting to executives.
Reporting on Mainframe Cost Savings
1) Identify Opportunities
The first step in optimization efforts is to identify opportunities. We’ve outlined some resources above to help you, but don’t stop there. Other resources include conference attendance (i.e. SHARE, GSE), newsletters like IntelliMagic’s and Watson & Walker almost always contain ways to optimize your system configurations and workloads. Other resources include blogs from industry thought leaders in the Db2 and CICS arenas.
2) Know Your Workloads and Configurations
Second, know your workloads and configurations. There’s no substitute for knowing your top 5 workloads, and if you’re on 4HRA, the top 5 workloads during the monthly peaks.
Reports like Figure 3 are essential. Tailor the views to what you want to focus on. Working with applications on specific consumers, Db2 users, CICS transactions, and time frames.
Figure 3 – Top 5 MSU Consuming Db2 Authids
Configurations are relatively static, but the addition of processors, LPARS, and workload changes over time create opportunities for Murphy to sneak in and create chaos.
3) Use Meetings to Highlight Optimization Opportunities
Third, use the regularly scheduled capacity / performance meetings to highlight your views of the most significant optimization opportunities. Several customers are using live dashboards during these meetings to highlight these opportunities and drill into the details.
Figure 4 – IntelliMagic Vision dashboard showing Top 10 MSU consumption drivers from several intervals
If you’re reading this blog, mainframe software costs constitute a significant portion of your IT budget. I hope this material provides you with some relatively simple steps to help your organization manage the mainframe costs of running the business. These include:
identifying the magnitude of mainframe application costs
knowing your mainframe budget details
knowing the differences of your mainframe software billing model
developing reporting and a cadence to the reporting that helps monitor and optimize these costs over time
Take the time to methodically go through your environment and make adjustments to lower your MSU consumption and software costs. Selecting the biggest opportunity is not always the best next step. Do a little qualification on the ‘size of effort’ to find a sweet spot and pursue it. Let us know how it goes…
Detailed Before-and-After Analysis of IBM z16 Upgrades
The z16 cache design marks a radical departure from previous IBM mainframes. Given the importance of efficient use of the cache subsystem to CPC performance and capacity, there has been great interest among the performance community in how z16s are performing in the ‘real world’. But there has been very little public information about actual customer experiences with these new models.
In this reprint from Cheryl Watson’s Tuning Letter 2023 No. 2, Todd Havekost provides detailed information and insights from his analysis of seven upgrades from both z14 and z15 CPCs to z16s.
From this analysis, mainframe sites interested in upgrading to a z16 will have an indication of what to expect in terms of how an upgrade to a z16 might impact their MSU consumption (and therefore, their software bills).
Software is the primary component in today’s world that drives mainframe expenses, and software expense correlates to CPU consumption in almost all license models. In this blog series, I am covering several areas where possible CPU reduction (and thus mainframe cost savings) can be achieved. The first two blogs (Part 1: Processor Cache, and Part 2: 4HRA, zIIP Overflow, XCF, and Db2 Memory) presented several opportunities that could be implemented through infrastructure changes.
This final blog in the series will cover potential CPU reduction opportunities that produce benefits applicable to a specific address space or application as listed here:
Write / rewrite programs in Java
Reduce job abends
Non-database data set tuning
Compile programs with current compiler versions and newer ARCH levels
Database and SQL tuning
As was also the case with the prior blogs, many items on these lists may not represent opportunities in your environment for one reason or another. But I expect you will find several items worth exploring which may lead to substantive CPU savings and reduced mainframe expense.
I will briefly address the first three items in this list, and then cover the last three in more detail.
Write or Rewrite Programs in Java
Since Java programs are eligible to execute on zIIP engines, and work that executes on zIIPs does not incur software expense, writing new programs or re-writing high CPU-consuming programs in Java can present a significant opportunity to reduce general purpose CPU and software expense.
Many sites find leveraging Java on the mainframe aligns well with their overall mainframe modernization initiatives. This was the case at one site with a successful initiative developing new RESTful APIs to access z/OS resources in Java. Leveraging z/OS Connect as their primary RESTful API provider, they have a 1700 MSU workload executing on zIIPs that would otherwise have been generating software expense on general-purpose central processors (GCPs).
Java programs also have an additional performance benefit for sites with sub-capacity processors. In contrast to GCPs running at the sub-capacity speed, zIIP engines always execute at full speed, benefiting Java programs executing on those zIIPs benefit.
Reduce Job Abends
Reducing job abends that have to be rerun and thus consume additional CPU can represent an opportunity for sites operating under a consumption-based software licensing model.
The cost saving benefit of avoiding job reruns provides additional impetus to the traditional rationale for wanting to reduce abends, namely avoiding elongating batch cycle elapsed times and their potential impacts on online transaction availability or other business deliverables.
Non-Database Dataset Tuning
Non-database data set tuning represents another opportunity to apply the principle that reducing I/Os saves CPU.
Avenues to explore here include ensuring data sets are using system-determined blocksize and improving buffering for high-I/O VSAM data sets through approaches such as Batch LSR or System-Managed Buffering.
Compile Programs with Current Compiler Versions and Newer ARCH Levels
Since IBM Z machine cycle speeds have been more or less flat since the zEC12 and are likely to remain that way in the future, increases in capacity delivered by each new processor generation are increasingly dependent on other factors. One very important aspect of that is exploiting new instructions added to the architecture for each new model.
IBM hardware chip designers have been working closely with compiler developers for some time now to identify frequently used high level language statements and adding new CPC instructions to optimize the speed of those operations. As each processor model delivers a new set of hardware instructions, that becomes a new architecture (ARCH) level.
A simple way to remember the ARCH level is that it is “n-2” from the processor model, so for example a z15 CPC has an ARCH level of 13.
Figure 1: Compiling with Newer ARCH Levels
The pie chart in Figure 1 shows that almost 20% of the current Z instruction set has been introduced since the z13. And the z16 added 20 more instructions.
Programs that have not been recompiled recently with current compiler versions may be missing out on significant efficiencies.
The table in Figure 2 shows an “average” percent improvement for COBOL programs in IBM’s sample mix from one ARCH level to the next (of course your mileage will vary). The IBM graphic on the right of the figure details specific COBOL functionality that was improved for each ARCH level. We see that decimal floating point arithmetic has been a particular point of emphasis.
Figure 2: Efficiencies at Newer ARCH Levels
Bottom line, compiling and recompiling COBOL programs with current compiler versions and ARCH levels is probably one of the biggest potential CPU reduction opportunities. This is a message IBM’s David Hutton repeatedly emphasizes in his conference presentations on processor capacity improvements.
Database and SQL tuning
Another area that has the potential to generate significant CPU efficiencies is database and SQL tuning.
The data manipulation capabilities of SQL are extremely powerful, but execution of the requested query can be very resource intensive, particularly since SQL’s advertised ease-of-use often puts coding of queries in the hands of inexperienced users.
To me that feels a bit like handing the keys to your Ferrari to your twelve-year-old. Sure you can do it, but what kind of outcome do you expect?
So giving attention to optimizing high-volume queries and the structures of their associated tables can be a source of major CPU-reduction opportunities. Let’s consider a real-life example of dramatic benefits from a Db2 SQL change.
Figure 3 is an example where visibility provided by a top 10 CPU report can help focus the analysis. Looking here at top DDF consumers, an opportunity was identified involving the second highest CPU consuming authorization ID.
Figure 3: Top 10 DDF Class 2 CPU – Before
CPU reduction efforts can represent a great opportunity for the performance optimization team to partner with Db2 teams in their buffer pool and SQL tuning efforts. The focus of the Db2 team is often on reducing response times by turning I/Os synchronous with units of work into memory accesses resulting from buffer pool hits. They typically seek to accomplish that in one of two ways, by increasing buffer pool sizes, or even better, by optimizing the SQL to eliminate the getpages and I/Os altogether.
The latter took place here – a SQL change that resulted in a dramatic reduction in I/Os (seen in Figure 4). The DBA SQL experts identified a table structure change. (It involved denormalized tables to avoid the need to run summarization calculations on child rows. Denormalization is the intentional duplication of columns in multiple tables to avoid CPU-intensive joins.)
This change reduced I/Os for the affected Db2 page set (red line) from peaks of 20K per second down to 500, a 97% reduction.
Figure 4: Db2 Sync I/Os (Top 10)
As you might expect, there was a similar magnitude of CPU reduction for the associated authorization ID (shown in Figure 5). This is another application of the principle we have seen across many of the tuning opportunities listed across all three blogs, namely, whenever you can eliminate I/Os, you achieve big wins for both CPU and response times.
Figure 5: Db2 Class 2 CP Time
Comparing before and after CPU for the affected authorization ID indicated a CPU reduction (15% of a CP) that translated into 175K consumption MSUs annually. And users experienced a 90% reduction in elapsed time per commit (Figure 6). So partnering together with the Db2 team you collectively become heroes not only for saving money but also with users for improved response times.
Figure 6: Db2 Response Time
The last tuning opportunity we will cover in this blog is HSM.
System-started tasks like HSM are workhorses, performing tasks essential to the functioning and management of z/OS environments. But they can also consume sizable quantities of CPU. I am highlighting HSM here because it is found very frequently in lists like Figure 7 of the “10 Most Wanted” CPU users among system tasks.
Figure 7: Consumption MSUs – Top Started Tasks
Figure 8: Section Headers from “Optimizing Your HSM CPU Consumption”, Cheryl Watson’s Tuning Letter 2018 #4.
Fortunately, Frank Kyne and IBM’s Glen Wilcock wrote an extensive article on this subject in the 2018 #4 issue of Cheryl Watson’s Tuning Letter that explained many suggested ways to reduce HSM CPU, as indicated by the section headings captured in Figure 8.
Achieving Mainframe Cost Savings Through CPU Optimization Opportunities
There is a pervasive focus today across the industry on achieving mainframe cost savings, which is primarily achieved through reducing CPU. This series of blog articles has sought to equip teams tasked with leading such initiatives by expanding awareness of common types of CPU optimization opportunities. It has presented a “menu” of opportunities, some broadly applicable across the infrastructure, and others that benefit a single address space or application.
Along with this awareness, optimization teams are aided in their mission by having good visibility into top CPU consumers and top drivers of CPU growth to help focus limited staff resources on the most promising opportunities. Several of the “top 10” views utilized through this blog series facilitate that.
And finally, effective analytical capabilities are essential in order to maximize CPU savings. The multitude of diverse workloads and subsystems in today’s mainframe environments combine to generate a mountain of data, making the capability to rapidly derive answers to identify and evaluate the multitude of potential CPU savings opportunities essential. Time consuming analysis reduces the set of opportunities that can be investigated and thus will inevitably result in missed savings opportunities. Or said another way, the site that can explore and close on ten lines of analysis in the time it takes another site to answer a single query will be able to identify and investigate and implement ten times as many savings opportunities.
The below resources will also be of use to you as you continue on your CPU reduction journey.
Software is the primary component in today’s world that drives mainframe expenses, and software expense correlates to CPU consumption in almost all license models. In this blog series, I cover several areas where possible CPU reduction (and thus mainframe cost savings) can be achieved. In the first blog, I covered Processor Cache optimization opportunities. In this blog, part 2 of Infrastructure Opportunities for CPU reduction, I will cover several additional Infrastructure opportunities:
Moving work outside the monthly peak R4HA interval
Reducing zIIP overflow
Reducing XCF message volumes
Leveraging Db2 memory to reduce I/Os
As discussed in the previous blog, optimization opportunities commonly found in the infrastructure often have two significant advantages over opportunities applicable to individual address spaces or applications:
They can benefit all work or at least a significant portion of work across the system.
They can often be implemented by infrastructure teams without requiring involvement of application teams, which understandably often have different priorities.
In the ideal situation, you can get IT-wide buy-in and commitment to succeed in identifying and implementing CPU reduction opportunities. But even if you don’t have that, you can still implement many of these infrastructure changes.
1) Move Work Outside Monthly Peak R4HA Interval
In addition to sites operating primarily under a monthly peak rolling 4-hour average (R4HA) interval license model, it is common for companies that have adopted consumption-based models to still have some software that is managed with peak R4HA terms. So many readers may find the potential of moving work outside the monthly peak interval to be a cost-saving opportunity worth investigating, particularly when those peak intervals occur at predictable times during the month.
Some sites have prime shift peaks set by online workloads, often on a particular day of the week. In other cases, monthly peaks occur during the night shift and are driven by batch, often at month-end. Opportunities for savings from workload shifting can arise when predictable monthly peaks combine with a sizable delta between that peak and typical activity levels (as shown in Figure 1).
Figure 1: R4HA MSUs
Started Task Maintenance Activities
One common opportunity for time shifting arises from regularly scheduled started task maintenance activities such as those performed by HSM or job schedulers.
The time band chart in Figure 2 captures one such scenario, with a narrow range of CPU on the 10th and 90th percentile days for the selected month indicating a highly repeatable daily surge in CPU consumed at the 1300 hour present by the job scheduler. If this hour commonly falls in the monthly peak 4-hour interval, rescheduling that daily maintenance represents a potential savings opportunity.
Figure 2: CPU Usage – Job Scheduler
Batch During Online Peaks
In contrast to online transaction-oriented workloads that must be serviced immediately, batch workloads often have some flexibility in terms of when they execute or at least the CPU intensity they receive.
For sites with online monthly peaks, this can lead to opportunities to examine batch executing during those online hours. Since batch typically executes in service classes with lower WLM importance levels (e.g., 4, 5, Discretionary), the view shown in Figure 3 can help quantify the size of the potential opportunity.
Figure 3: CPU Usage by WLM Importance Level
Peaks During Month-End Batch Processing
For sites with monthly peaks occurring during month-end processing at night, analysis at the address space level will usually be required to identify jobs that could be candidates for being scheduled outside that peak interval.
Figure 4 represents one such view, initially filtered by address spaces executing in service classes classified as discretionary to WLM.
Figure 4: CPU Usage (Top 10) – WLM Importance Level of Discretionary
2) Reduce zIIP Overflow
Another infrastructure-driven area that can help achieve cost savings is reducing zIIP-eligible CPU that overflows onto general purpose CPs (GCPs) where it is chargeable for software expense.
As with other potential opportunities, an initial step is to quantify the size of the opportunity. The monthly view of zIIP overflow consumption MSUs by workload shown in Figure 5 indicates that the Db2 workload (in yellow) is the primary driver of the overflow, with an amount varying between 15K to 35K consumption MSUs per month.
Figure 5: Monthly zIIP Eligible Consumption MSUs by Workload
Assuming that is worth pursuing as a percentage of overall consumption, analysis would continue by drilling into the details that Db2 workload. In this scenario, the overflow work is coming almost entirely from a single service class which consists of DDF work.
At that point, analysis would likely proceed by moving over into Db2 Accounting (SMF 101) data and viewing the DDF zIIP overflow CPU by Authorization ID (a common way to analyze DDF work in Db2). That view (Figure 6) shows five auth IDs that are responsible for almost all the spillover work, and also shows a very consistent pattern of big spikes around midnight each day.
Figure 6: CP Time Used for zIIP Eligible Work (Top 5) by Authorization ID
That raises the question as to whether this spillover is because the zIIP CPs are very busy or is due to a very spikey arrival rate. A view of zIIP utilization (not shown) indicates the zIIPs are not overly busy, so that points to arrival rate as the driver of this spillover.
One approach for further analysis within Db2 would be to examine the rate of getpages, a key measure of Db2 activity. Viewing that data for the DDF workload (Figure 7) shows a huge spike (over 2 million per second) every day at midnight for the top auth ID (in red), validating that the spillover from zIIPs to GCP is indeed due to an extremely spikey arrival rate
Figure 7: Getpages by Authorization ID
So if you are in a consumption-based software license model, or in a peak R4HA model with monthly peaks that include midnight, you would want to explore if the work driven by that auth ID could afford to wait the few milliseconds until a zIIP CP would likely become available. If so, you could reduce this spillover by assigning this work to a WLM service class defined with ZIIPHONORPRIORITY=NO.
3) Reduce XCF Message Volumes
As sysplexes have evolved, the amount of communication and sharing across the systems has increased dramatically. This intra-sysplex communication is facilitated by two key components, the Coupling Facility (CF) that maintains shared control information (e.g., database locks), and XCF (Cross-System Coupling Facility) that facilitates point-to-point communication by sending messages to other members of an application group. CF metrics are relatively well known, but XCF metrics (from RMF 74.2 records) have typically received less visibility.
XCF is exceptionally good at reliably delivering messages at high volumes. I have seen volumes far exceeding one million messages per second. But it so good at its job that it is often ignored.
The problem with that is that sending and receiving high volumes of messages drives CPU, both for XCFAS and the address spaces processing the messages. So system configuration decisions that generate unnecessarily high XCF message volumes can waste a significant amount of CPU.
4) Leverage Memory to Reduce I/Os – Db2 Use of Large Frames
One underlying driver of numerous CPU reduction opportunities across infrastructure areas is leveraging memory to reduce I/Os, since memory accesses are far more CPU efficient (not to mention far faster) than the many operations required to perform I/Os.
As the biggest mainstream exploiter of large quantities of central storage, Db2 may represent your biggest potential source of CPU reduction opportunities from trading memory for I/Os.
Buffer Pool Tuning
Db2 buffer pool tuning typically represents the biggest opportunity for large scale memory deployment. Db2 industry experts present various buffer pool tuning methodologies, which utilize various criteria for identifying buffer pools which are top candidates to receive more memory, including Total Read I/O Rate, Page Residency Time, and Random BP Getpage Hit Ratio.
Using Large Frames to Back Buffer Pools with High Getpage Rates
But there is another incremental memory-related CPU opportunity with Db2 that goes beyond pool sizes. It involves using fixed large 1MB or 2GB frames to back buffer pools with high getpage rates. This can save CPU because large page frames make translations of virtual storage to real storage addresses more CPU efficient, reducing the CPU cost of accessing pages that are cached in the buffer pool.
IBM indicates measurable savings can be achieved when large frames are used to back buffer pools with getpage rates exceeding 1000 per second.
When a Db2 buffer pool is defined to be page-fixed and z/OS has 1MB fixed frames available, that buffer pool will be backed with 1MB large frames, resulting in CPU savings from more efficient address translations. A view like Figure 8 can be used to identify buffer pools with the highest getpage activity in a data sharing group and thus top candidates to leverage available large frames if available.
Figure 8: Getpages for 4K Buffer Pools for selected Data Sharing Group
In z/OS 2.3 IBM made significant enhancements to the flexibility with which z/OS manages large frames. The big enhancement that relates directly to this subject is that the 1MB LFAREA (large frame area) is now managed dynamically by z/OS, up to the size specified in IEASYSxx Parmlib member.
That storage is no longer reserved and set aside at IPL time as it was prior to 2.3. Thus, it can be allocated with some headroom and grown into as buffer pool sizes increase, which is particularly helpful since the LFAREA upper limit definition can only be changed with an IPL.
Since the LFAREA parameter can only be changed with an IPL, it needs to be managed carefully. It is a Goldilocks parameter; you want it to be “just right.” Too high can cause shortages of regular 4K pages for the rest of the system, likely seen initially in demand paging rate and possibly more severe symptoms. Too low and Db2 fails to gain the CPU efficiencies that large frames could provide.
The RMF metrics for managing fixed 1MB frames in the LFAREA are not complicated, but Db2 and z/OS teams will want to have good visibility into them. Figure 9 provides a view that brings together key metrics for a z/OS system.
Total central storage allocated to the LPAR (in yellow)
Size of the LFAREA (green)
Fixed 1MB frames in use (red)
Fixed 1MB frames available (blue)
Figure 9: Total Storage and 1MB Fixed Frames
So in this example, if some Db2 buffer pools have higher getpage rates and could make good use of additional 1MB frames, capacity exists without requiring an IPL.
Finding Mainframe Cost Savings Through Infrastructure-Wide Opportunities
CPU reduction is the primary means of attaining mainframe cost savings, and effective analysis is essential in identifying and evaluating potential CPU reduction opportunities. The first two blogs in this series have explained several items on this “menu” of common infrastructure-related CPU optimization opportunities.
Figure 10: Infrastructure Opportunities
Future iterations of this blog will present other avenues for potential savings applicable to specific address spaces. Many items on both lists may not represent opportunities in your environment for one reason or another. But I expect you will find several items worth exploring as potentially applicable which may lead to substantive CPU savings and with it reduced mainframe expense.
Combining great visibility into top CPU consumers and top drivers of CPU growth along with awareness of common opportunities for achieving efficiencies positions you for success as you focus your scarce staff resources on the most promising opportunities.
Ways to Achieve Mainframe Cost Savings
This webinar is designed to expand your awareness of ways sites have reduced CPU and expenses so that you can identify ones potentially applicable in your environment.
Software is the primary component in today’s world that drives mainframe expenses, leading to a direct correlation between software expansion and CPU consumption. Whether it’s a one-time charge model, total CPU consumption, or monthly peaks in rolling four-hour average models, the primary opportunity to achieve cost savings is to reduce CPU. To accomplish this, effective analysis is essential to identify and evaluate potential CPU reduction opportunities.
However, with the multitude of diverse workloads on the mainframe, it becomes essential to have visibility into top CPU consumers and top drivers of CPU growth to focus on the most promising opportunities. One way this can be achieved is through dynamic dashboards designed to give team members a common view across the organization and to enable them to quickly identify and drill down to pursue more detailed analysis of metrics of interest. Figure 1 illustrates a dashboard that captures top 10 CPU consumers for several types of workloads.
Figure 1: IntelliMagic Vision Dashboard with “Top 10” reports
In addition to CPU reduction, supporting sizable business volume growth with minimal incremental CPU can be another primary avenue of savings for enterprises. For instance, supporting 30% growth in business volumes with a 7% increase in CPU could be considered a significant win.
Timely analysis is also crucial in today’s world, with the growing prevalence of consumption-based software license models where all CPU is in scope. This increases the importance of having visibility into CPU drivers that may be executing at any time.
Furthermore, declining staffing levels leave less time available for analysis, which makes it critical to have capabilities that provide rapid answers for the limited amount of time available for analysis. The growing wave of retirements of 30-plus-year specialists further emphasizes the value of tooling that enables the remaining skilled staff to collaborate across disciplines and contribute in areas outside their specialty. Intuitive shared tooling helps facilitate this teamwork.
The lengthy time often required to answer even relatively simple analytical questions when using legacy tooling reduces the scope of availability exposures and efficiency opportunities that can be investigated by the team, and thus, inevitably results in missed opportunities. Or said another way,
the site that can explore and close on ten lines of analysis in the time it takes another site to answer a single query will be able to identify and investigate and implement ten times as many optimizations.
CPU Optimization Opportunities: Infrastructure Efficiencies
Potential CPU optimization opportunities can be grouped into two primary categories:
Ones that are applicable to individual address spaces or applications; and
Ones that apply more broadly across the infrastructure.
This blog will cover some significant optimization opportunities commonly found in the infrastructure. These opportunities often have two significant advantages.
They can benefit all work or at least a significant portion of work across the system.
They can often be implemented by infrastructure teams without requiring involvement of application teams, which understandably often have different priorities.
In the ideal situation, you can get IT-wide buy-in and commitment to succeed in identifying and implementing CPU reduction opportunities. In a very successful MLC reduction initiative at my previous employer, the executive over Infrastructure had also previously led Application areas, so he had organization-wide influence and credibility. But if you don’t have that, you can still implement many of the infrastructure changes we will now cover.
Processor Cache Efficiencies
With the architectural changes first introduced with the z13, processor cache efficiency became a major factor in CPU consumption. IBM has implemented cache-related architectural enhancements for each new processor generation, yet for many workloads opportunities to achieve processor cache efficiencies can still contribute toward significant CPU reductions. How can you determine whether this represents an opportunity for you?
First, identify the volume of work executing on Vertical Low (VL) logical CPs. This represents work exceeding the LPAR’s guaranteed share derived from its LPAR weight. Vertical High (VH) CPs have dedicated access to a physical CP and its associated cache. But work running on VLs is exposed to cross-LPAR contention for cache so that its data often gets flushed out of cache by data from other LPARs.
Figure 2 provides a view to answer the question “do you have a significant amount of work executing on VLs?” If not, you can look elsewhere. If so (as is the case in Figure 2), then you need to answer one other question.
Figure 2: MIPS Dispatched on Vertical Low CPs
If not, you can look elsewhere. If so (as is the case in Figure 2), then you need to answer one other question. “Is there a significant difference in Finite Cycles per Instruction (CPI) between your work executing on VLs and that running on VHs and VMs?” Finite CPI quantifies the machine cycles spent waiting for data and instructions to be staged into Level 1 cache so that productive work can be done. So another way to express this question is “do your VLs have a significant Finite CPI penalty?”
To answer that question, you first need to understand your Vertical CP configuration. In the example in Figure 3, the 10 logical CPs for this system are configured as 5 VLs (in red & blue) and a combination of 5 VHs and VMs (in green and yellow).
Figure 3: Vertical CP Configuration
Armed with this understanding, you can examine Finite CPI by logical CP to quantify the penalty for the VLs, as shown in Figure 4. A sizable gap between the two sets of lines as seen at the arrow here is a visual indicator of a substantial VL penalty.
Figure 4: Finite CPI by Logical CP
Here is how the math works, first quantify the Finite CPI penalty for the VLs. To translate this into CPU savings, it needs to combined with the contribution of Finite CPI (waiting cycles) to overall CPI, a value also provided by the SMF 113 data.
On this system, the work executing on VL logical CPs incurs a 39% Finite CPI penalty, and Finite CPI (waiting cycles) makes up 51% of total CPI. So multiplying those two numbers results in a total CPU penalty of 20% for work executing on VL CPs. Or expressed another way, on this system changes that reduce the amount of work executing on VLs would reduce the CPU for that work by 20%.
If you have sizable cache efficiency opportunities like this, what kind of changes can you explore?
Processor Cache Efficiency: LPAR Weight Changes
One course of action could be LPAR weight changes to improve the vertical CP configuration. A good way to identify any such opportunities is through an Engine Dispatch Analysis view as seen in Figure 5, which compares CPU usage of an LPAR over time to its allocated share.
Figure 5: Engine Dispatch Analysis – SYS1
If you have sizable cache efficiency opportunities like this, what kind of changes can you explore? One course of action could be LPAR weight changes to improve the vertical CP configuration. A good way to identify any such opportunities is through an Engine Dispatch Analysis view as seen in Figure 5.
This view presents four variables.
The number of physical CPs on the processor (in yellow, 18 here)
The number of logical CPs for the selected LPAR (in blue, 9 here)
The LPAR guaranteed share (in green), which is a function of LPAR weight and # of physical CPs
Note: The LPAR share is variable here because IRD (Intelligent Resource Director) is active causing PR/SM to dynamically change LPAR weights.
The interval CPU consumption in units of CPs (in red)
We will analyze two different LPARs that reside on the same processor. The arrows on Figure 5 highlight night shift intervals for this first LPAR (SYS1) where CPU usage far exceeds its share, resulting in lots of work executing on VLs.
Compare this to the situation for a second LPAR (SYS2) during those night shift intervals, shown in Figure 6. The guaranteed share (in green) is far higher than CPU usage (in red), reflecting that SYS2 has far more weight than it needs.
Figure 6: Engine Dispatch Analysis – SYS2
This is an example where automated LPAR weight changes between these two LPARs could help that night-shift work run more efficiently on SYS1 with no adverse impact to SYS2.
Processor Cache Efficiency: Sub-capacity Models
A second approach to consider to potentially improve cache efficiency is sub-capacity (“sub-cap”) processor models. They were originally created to give smaller customers more granular upgrade options. But having more (slower) physical CPs means more CPs for the same purchased capacity and thus more processor cache. Since effective use of cache can play a big role in delivered capacity, sub-cap CPCs often deliver more capacity than expected.
Historically this was considered only of interest to sites with relatively small installed capacities. But today large sites often have a mix of larger production CPCs and smaller CPCs with development LPARs, the latter may also be excellent candidates for a sub-cap model.
Also, the fact that on z16 processor models 6xx models now run at x% of full speed (vs. x% on prior z15 models) can also expand the pool of sites that could benefit from sub-cap models. The legitimacy of this approach was reinforced earlier this year during the SHARE Performance Hot Topics session in Atlanta when Brad Snyder from the IBM Washington Systems Center also advocated for a growing role for sub-cap models.
CPU reduction is the primary means of attaining mainframe cost savings, and effective analysis is essential in identifying and evaluating potential CPU reduction opportunities. Visibility into top CPU consumers and top drivers of CPU growth can help focus scarce staff resources on the most promising opportunities.
There are a wide range of potential options on the “menu” of common CPU optimization opportunities. Future iterations of this blog will present many other avenues for potential savings. Many items may not represent opportunities in your environment for one reason or another. But I expect you will find several items worth exploring as potentially applicable which may lead to substantive CPU savings.
Ways to Achieve Mainframe Cost Savings
This webinar is designed to expand your awareness of ways sites have reduced CPU and expenses so that you can identify ones potentially applicable in your environment.
In my earlier blog on managing the gap in IT spending and revenue growth, I discussed capacity planning as a process that aims primarily to feed the budget process in your organization, and I emphasized the importance of planning for an efficient system. I also acknowledged the difficulty of forecasting the future. This point is emphasized in some recent news. Glacier National park is removing signs posted throughout the park because the predicted timing of the disappearance of many glaciers turned out to be incorrect.1
Previously, (again in my first blog referenced above) I introduced one specific opportunity to draw out RNI improvements. There I demonstrated how this example helps close the gap between revenue growth and higher MSU demand growth and demonstrated the value in planning for an efficient system. In this blog I will unpack a few more.
Reducing Demand for IT Savings and Investment
You have a unique opportunity as a mainframe performance and capacity planning analyst to evaluate the performance of your systems and the major consumers of its resources. This gives you an opportunity to enumerate recommendations for IT savings and investment. Thankfully, a few authors acknowledge this important aspect of capacity management. 2
When you deliver that 18 month or longer projection for the next budget cycle, try to include some things that bend the projection downward. The forecast might still be wrong, but you will demonstrate your value to the organization with efficiency saving ideas. These will energize you and your team to consider other options for reducing demand so that the company can invest in higher growth areas.
What are some other avenues to reduce demand?
As with many things, speed is important and lower latency per unit means more throughput, better customer satisfaction, and more ‘think time’ to sell into. Look into CICS transaction waits, DB2 waits, Coupling Facility delays, and Disk Delays.
Below are some specific examples:
Evaluate Your Database for Indexes
In extreme cases, improvements close to an order of magnitude might be present in your DB2 CPU / transaction performance. Remember the last time a critical index was dropped? Act on that answer and proactively provide your database experts some reports to encourage them to review the indexes that may provide improvements.
Use your resources to look for the most expensive SQL statements (top 5). Is the expensive SQL causing a huge number of getpages unnecessarily? While you may need more than the following chart of getpage activity to aid your analysis, you may find clues to help you seize opportunities and enjoy showing your colleagues and leaders the improvements you can quantify!
The IntelliMagic Vision reports below show the getpages dropping while the overall business goes up! Telling your boss they are getting more for less with reports like these are always a win-win-win!
Before and After Index Compare of Plan Getpages for a Specific AuthID
Before and After Index Registration Transaction CPU and Registration Rate (transactions)
Explore New Technology Offerings
Exploit compression – both outside the system (tape system compression, and network appliances), and at the processor level (zEDC), and now on board with the z15 coprocessors. The investment in newer technology can reduce latency and increase efficiency by reducing general CP demand in your system.
Subtle usage patterns in your application and or business models may provide avenues to prioritize the work more effectively or just plainly reduce demand.
Identify Opportunities to Segment Lower Importance Work
Work with the business to identify opportunities to segment lower importance business work and employ strategies to limit demand from those workloads vs. more important work.
Use WLM and software capping to limit the impact on your loved ones. These are well known and underutilized methods for reducing and gating more discretionary work.
For example, digital shopping transaction rates/booking rates vary greatly for some online travel sites. Some sites drive over 1,000 shopping transactions/booking, whereas an airline site might drive less than 20 shopping transactions/booking.
If your backend systems segmented the importance of that work rather than treating each shopping call with equal priority, a very likely result would be more profit from less IT infrastructure demand. Making these changes will require a new mindset with some folks in your business, but it has great potential. Furthermore, there are easier paths to take for systems that have true workload management (WLM) capabilities.
It’s an old idea, but it still holds true. Evaluating the big IO drivers related to your primary CPU consumers and evaluating ways to reduce IO will always improve performance and lower CPU demand per transaction. Use buffer pool residency times and Buffer pool Hit % to help you evaluate buffer pool sizing. There are many drilldowns available within IntelliMagic Vision to help you pursue efficiencies with your database counterparts.
Data Sharing Page Residency and Buffer Pool Hit %
Do you have a quick view to evaluate coupling facility false lock contention? The cost of the lock manager arbitration of the contention is a 100:1 as compared to a regular coupling facility lock request. Sometimes, additional coupling facility memory for a structure will reduce the false lock contention by increasing the number of lock entries, such that more unique hash values are used.
Coupling Facility Lock Usage
Know Your Top Workloads.
For example, a quick modification of one of the IntelliMagic Vision out-of-the-box reports provides a top 5 list of CP usage for a given day.
Qualified Top 5 List of CP Usage for a Given Day
You may already know what those are, but do you know what’s driving them to grow? Perhaps a new one has made it into the top 5 and you have other concerns. The time of day, and other details are easily accessed with some drilldowns and report editing features will help you take action.
Recognize that when evaluating the new software pricing models for z/OS, every CPU cycle matters with Tailored Fit Pricing. Finding savings in those big buckets first is a good place to begin.
Review Abended Jobs
What about the accumulated CPU consumed by jobs that abend, resulting in wasted CPU time and re-runs?
Here’s a quick view (out of the box) of resource consumption for abended jobs.
Resource Consumption for Abended Jobs
Reviewing this type of information and details from available drilldowns can help you act on preventable re-work and will help drive out unnecessary re-runs and reduce overall demand.
Conclusion: Bending the IT Demand Curve Down
Mainframe capacity planning is an important budgetary input. If you ensure your performance is in order or have mentioned some efficiency recommendations, you will help keep your organization lean and productive.
Look for opportunities to improve performance (efficiency) as you produce the artifacts necessary for the budget process.
We discussed several ways to reduce latency and lower demand in this brief overview. There are likely other ways that may be specific to your organization. Ask questions of your business, application and partners, like IntelliMagic.
We would welcome an invitation from you and your team to help you demonstrate your value in the organization and bend that IT demand curve down. Contact us here.
How does your MSU growth compare to your revenue growth? As the new year begins, how are your goals aligning with the business?
While this is purely a hypothetical example that plots typical compute growth rates (~15%) with business revenue growth (~4%), there are many of you that would agree with the widening gap. Whether it is MSU’s or midrange server costs vs. revenue, this graphic is commonplace.
The simplification above ignores seasonality and other variables that may also impact growth, but it demonstrates the problem nearly every CIO has. How can the gap for IT spending vs. revenue and/or profit be managed so that additional spending on other key components of the business can deploy the cash needed to build revenue and profit?
Marginal Capacity Plans
While the overall unit cost of computing continues to drop due to the fantastic advances in the technology over the years, few involved in performance and capacity can claim a long term drop in peak demand ‘resource costs’ over time due to changes your organization has made. In fact, most would share something similar to the above graphic over the last few years. Is your latest projection going to look different?
The actual values observed in 2-5 years for resource demand-and IT costs, as compared to prior forecasts, are often off by large margins. These deviations led a colleague to make the following statement in reference to a new capacity plan: “the only thing I can be certain of with this plan is that it will be wrong”. While there are many possible reasons that are outside your control, it is a cold reality.
And it might ‘hurt’ our pride a bit, but there is some truth there.
Some advice to capacity planners
Don’t take it personal.
Planning is useful primarily as a budgeting tool – treat it as such. Don’t expect a beautiful mathematical modeling exercise that predicts the march madness bracket winners correctly – because it won’t!
The primary drivers influencing changes are typically outside your control – Don’t ignore them; list them, try to obtain some valuable insights from your business and application partners. Focus on the top two or three, (80/20 rule) and lump the rest in an ‘organic growth’ It’s a long tail and quantifying all of it will cost you more in time and money than you can save by budgeting better for 25 different micro services that are less than 1% of the budget each.
Tell the story with good graphics and succinct insights.
Identify useful alternatives to just buying more CPUs, memory, disks and network bandwidth.
Good performance is primary.
What is Good Performance?
Sometimes we in the mainframe arena take “good” performance for granted. We have WLM, capacity on Demand, zIIPs, PAVs, and on down the list.
“Good performance” is meeting all of those response time goals while being efficient. Two people can drive from Denver to San Francisco with similar comfort in an SUV at 8 MPG with some bad spark plugs or in a 48 MPG hybrid vehicle.
Planning your trip does involve the workload and capacity of your current vehicle, but given your situation, our focus is on efficiency. We want to help you stretch that 8 miles per gallon (MPG) to 25 MPG for the SUV. What should the focus be on in the mainframe performance before we produce the plan?
Some efficiency focused metrics worth pursuing include things like:
Just a quick refresh on RNI: a lower value indicates improved processor efficiency for a constant workload. A lower RNI drives lower MSU for the same workload, and the results can be significant!
There are several ways to drive for lower RNI, and the reference above gives you several ideas on where to start. Look at how a small change in a performance metric can alter the long-term capacity plan!
Performance Led Capacity Plan
While you don’t often receive a gift like this in your performance options, keep informed.
Part of your capacity planning regimen should be working with your colleagues in systems, DB2, CICS, and applications to solicit changes that might deliver a welcome demand drop and slower future growth. A small change in the rudder, can move a mighty ship!
System Efficiency and Your Capacity Planning Process
Capacity planning is an important input to the budget process. Efficiency recommendations will help you keep your organization lean and productive. Look for some opportunities to improve efficiency as you produce the artifacts necessary for the budget process.
In this first blog, I have provided you one efficiency check to consider. Are there better ways to configure and manage your systems to reduce your RNI? In my next blog, I will open the door for some other ideas to evaluate efficiency as you prepare your capacity plan for the future mainframe growth. Feel free to reach out to us for more insights as we develop part two of this post.