This doesn’t happen often, but let’s assume you have been assigned to find some cost savings, and you have tools to employ. Finding ‘gold’ in a productive mine field is fun! Your experience with other systems, in similar and other industries, has given you some pretty good ideas on what to look for.
Frequently, CICS ‘tuning’ exercises focus on unnecessary TCB switches. Without getting into what these are and why these might be happening. What makes a good candidate for improvement opportunities?
This report has a clear winner for biggest consumer, but after #1, the results tail off quickly. There’s always the possibility it might be ‘fools gold’, but you can’t ignore the size of the 1st item compared to the others.
There’s sometimes a debate on whether TCB switch time or TCB switches are the key item of focus for tuning efforts. Since both metrics are directly related to the same SMF field, it’s not surprising that there is a strong correlation between them, as seen below. It’s also important to recognize that the big hill in the middle of the report is the day-time peak for this transaction rates. Analysts should always be mindful of the distinction between correlation and causation. It may not be the what, but the how that’s important for TCB switches.
Figure 2: Transaction Rate compared with TCB Switch count and TCB switch time
If TCB Switch time is an important correlation of CPU consumption, then the next Pareto can help us with some early verification of the quality of the nuggets we are looking for. The top CPU consumer, call it Claims, in the first Pareto chart may only be 4th or 5th down the line in TCB Switch Wait Time Pareto chart, so that keeps us on track for a more in depth view of what / how TCB Switch Wait is occurring.
Figure 3: Pareto Chart-CICS transaction TCB Switch Time
The CICS and application experts typically apply their knowledge and experience to find several options for reducing CPU time. I’ll dig deeper into that in the paragraph below, but the next report shows some remarkable improvements!
Figure 4: Before-and-after results show a dramatic drop in CP Processor Time and TCB Switch Wait Time
After making a change, TCB switches were greatly reduced – saving up to 2000 CPU seconds at peak for essentially the same transaction workload.
CICS Transaction Details and CPU drivers
Sometimes a fresh set of experienced eyes can spot things that should be done and are easily done. With hundreds of CICS SMF metrics, what are some that we haven’t covered yet that may drive CPU and/or TCB switches?
Some that were involved in these savings include: VSAM Strings, and Buffers and just plain I/O. As this IBM reference suggests, some of this is configuration related and may have been relatively easy to implement. I/O is often overlooked as a driver of CPU, but it does drive CPU, and avoiding I/O with application changes and configuration changes should always be on the list of savings ideas. Trading memory for MSUs $ave$.
Below are a few reports highlighting the changes made over time. Note the critical dates around Dec. 2022, and May 9, 2023. The first report shows a change in mid-December where the number of CICS Active Strings begins showing up and progressively growing worse until early May.
Figure 5: CICS File Control Statistic – Active Strings (current active updates against a file)
The applications changes included CICS changes to ensure performance, but there aren’t always easy ways to ensure performance is protected and costs are controlled. This report below shows that buffer changes went in with the application changes.
Figure 6: Defined file control data buffers by region
Big Changes in May
As with many efforts, there’s often more than one cause, and more than one action required to remedy. In this situation, a few more actions may be required, but some action was taken to reduce VSAM browse requests. This large drop coincides with the drop in CPU and TCB switch time.
Figure 7: CICS VSAM File Browse Requests
There are more interesting details, and with so many CICS metrics and additional insights, it’s hard to keep things concise. The intent here is to keep it brief and highlight a few of the interesting points. The upshot for the system is a lower processor demand. In addition, a few key transactions are much more responsive. Specifically, the response time dropped from a peak of 700 ms, down to below 10 ms, while the CPU time went from nearly 40 ms, down to under 6 ms per transaction.
Figure 8: Claims transaction profile
Sharing the Success
The tendency of experts can be to share so many details on what was done that we lose track of the important message. How the changes reduced the TCB switch time, which in turn saved a ton of CPU, up to 2,000 seconds. This part of the story was intended to share some of the details and provide some incentive to proactively look for changes in workloads before problems arise. I look forward to sharing a bit more on effectively sharing the success in an upcoming blog. We would love to help you find the ‘nuggets’ quicker and would welcome an opportunity to share the kinds of insights that will help you deliver more for less!
Most of us can identify with situations where we are deluged with data, and we face challenges wading through it all and pulling out the value. A scenario that comes to mind is the proliferation of TV shows now available. As if the shows on the hundreds of networks available on cable TV weren’t enough, the challenge is compounded by the plethora of programming now available through streaming services.
Current cable TV “GUIDE” functionality (where you scroll page by page through a list of hundreds of shows) certainly doesn’t cut it. The streaming services attempt to provide some help through lists of shows divided into categories like drama or comedy, but unless you know specifically what you are looking for in advance, their search capabilities still leave a lot to be desired.
Analyzing CICS Performance
Mainframe performance analysts can consider themselves fortunate to have access to more extensive metrics than on any other platform, but often find themselves facing a similar uphill battle trying to sift through gigabytes if not terabytes of SMF data produced daily at their sites. This task of deriving value from massive quantities of data can be particularly challenging for analysts charged with effectively analyzing CICS performance from the SMF 110 CICS Monitoring Facility data.
It is clear this cannot be accomplished by analysts sifting through tabular data presented in countless static reports. To successfully achieve this, there are conceptually two ways you need to put the power of the computer to work for you.
One is for the computer to be automatically checking key metrics associated with the thousands of elements in your infrastructure to identify hidden risks to availability.
A second is to exploit a powerful graphical user interface to dynamically interact with the data, leveraging context-sensitive drilldowns enabling you to determine based on the current chart what the next logical analytical step should be.
In this blog I am going to focus on that second aspect, the analytical value provided as a highly flexible user interface makes data easily accessible.
Analyzing Time Measurements for CICS Transactions
Of the nearly 400 fields in the SMF 110-1 records, approximately 100 represent various time measurements (e.g., CPU, response, wait), so that is clearly an important focus for analysis. As a starting point, one might choose top transactions by total CPU consumed (typically called “CPU intensity”) or by transaction volume. The latter appears in the screenshot below generated using IntelliMagic Vision.
From here, we can drill down to examine the response time of a specific transaction over the indicated time interval, divided into the commonly-defined categories for CICS response times.
The ability to select other metrics for potential correlation can also facilitate analysis. In the following example, the report was customized to add “transaction rate” (the blue-green line) on the right axis (any variable could be selected).
Further analysis into the “Total I/O Wait Time” category may be warranted and can be achieved with a single click. In this example, “Wait for Control at end of MRO” constitutes almost all the wait time.
CICS also reports on the components of CPU in great detail, as reflected in the legend of this chart.
“Before and after” analysis (e.g., of the impact of an application release) on CPU per transaction can quickly be determined by converting this into an area chart and selecting any interval in the database for comparison (requiring only a few clicks).
It may be advantageous to begin analysis from the perspective of response times for high volume transactions (as in this example).
Analysis for a selected transaction could continue through drilldowns into response times by CICS regions and/or by z/OS systems (pictured below) to determine if response times varied significantly across the sysplex (in this case response times were generally similar).
Analysis could also begin from the viewpoint of CICS regions, with similar drilldowns including by transaction, response time categories and components, CPU time, z/OS system, etc. (not shown here due to space limitations).
Making CICS Data Manageable
In addition to the time-based metrics previously considered, there are twenty other categories of metrics provided in the SMF 110-1 records, as seen in the column on the right.
Easy visibility to explore this data and context-sensitive drilldowns for detailed analysis make these metrics readily accessible and positions the analyst to quickly derive value from them.
Whether you’re searching for the right TV show or the response time of certain CICS transactions, you need an easy and effective way to filter out the noise and rapidly find what you’re looking for.
From high level overviews to detailed, intuitive drilldowns, IntelliMagic Vision takes the previously unmanageable quantity of CICS data and makes it manageable. Rather than sifting through page after page or report after report trying to identify the significant data points, your analysis can proceed in a direct and expeditious manner to find key insights and actionable information.
This is one of the key reasons why over the course of my decades-long career spent primarily as a performance analyst I found IntelliMagic Vision to be a game-changing tool enabling me to perform analysis at the highest level.
How does your MSU growth compare to your revenue growth? As the new year begins, how are your goals aligning with the business?
While this is purely a hypothetical example that plots typical compute growth rates (~15%) with business revenue growth (~4%), there are many of you that would agree with the widening gap. Whether it is MSU’s or midrange server costs vs. revenue, this graphic is commonplace.
The simplification above ignores seasonality and other variables that may also impact growth, but it demonstrates the problem nearly every CIO has. How can the gap for IT spending vs. revenue and/or profit be managed so that additional spending on other key components of the business can deploy the cash needed to build revenue and profit?
Marginal Capacity Plans
While the overall unit cost of computing continues to drop due to the fantastic advances in the technology over the years, few involved in performance and capacity can claim a long term drop in peak demand ‘resource costs’ over time due to changes your organization has made. In fact, most would share something similar to the above graphic over the last few years. Is your latest projection going to look different?
The actual values observed in 2-5 years for resource demand-and IT costs, as compared to prior forecasts, are often off by large margins. These deviations led a colleague to make the following statement in reference to a new capacity plan: “the only thing I can be certain of with this plan is that it will be wrong”. While there are many possible reasons that are outside your control, it is a cold reality.
And it might ‘hurt’ our pride a bit, but there is some truth there.
Some advice to capacity planners
Don’t take it personal.
Planning is useful primarily as a budgeting tool – treat it as such. Don’t expect a beautiful mathematical modeling exercise that predicts the march madness bracket winners correctly – because it won’t!
The primary drivers influencing changes are typically outside your control – Don’t ignore them; list them, try to obtain some valuable insights from your business and application partners. Focus on the top two or three, (80/20 rule) and lump the rest in an ‘organic growth’ It’s a long tail and quantifying all of it will cost you more in time and money than you can save by budgeting better for 25 different micro services that are less than 1% of the budget each.
Tell the story with good graphics and succinct insights.
Identify useful alternatives to just buying more CPUs, memory, disks and network bandwidth.
Good performance is primary.
What is Good Performance?
Sometimes we in the mainframe arena take “good” performance for granted. We have WLM, capacity on Demand, zIIPs, PAVs, and on down the list.
“Good performance” is meeting all of those response time goals while being efficient. Two people can drive from Denver to San Francisco with similar comfort in an SUV at 8 MPG with some bad spark plugs or in a 48 MPG hybrid vehicle.
Planning your trip does involve the workload and capacity of your current vehicle, but given your situation, our focus is on efficiency. We want to help you stretch that 8 miles per gallon (MPG) to 25 MPG for the SUV. What should the focus be on in the mainframe performance before we produce the plan?
Some efficiency focused metrics worth pursuing include things like:
Just a quick refresh on RNI: a lower value indicates improved processor efficiency for a constant workload. A lower RNI drives lower MSU for the same workload, and the results can be significant!
There are several ways to drive for lower RNI, and the reference above gives you several ideas on where to start. Look at how a small change in a performance metric can alter the long-term capacity plan!
Performance Led Capacity Plan
While you don’t often receive a gift like this in your performance options, keep informed.
Part of your capacity planning regimen should be working with your colleagues in systems, DB2, CICS, and applications to solicit changes that might deliver a welcome demand drop and slower future growth. A small change in the rudder, can move a mighty ship!
System Efficiency and Your Capacity Planning Process
Capacity planning is an important input to the budget process. Efficiency recommendations will help you keep your organization lean and productive. Look for some opportunities to improve efficiency as you produce the artifacts necessary for the budget process.
In this first blog, I have provided you one efficiency check to consider. Are there better ways to configure and manage your systems to reduce your RNI? In my next blog, I will open the door for some other ideas to evaluate efficiency as you prepare your capacity plan for the future mainframe growth. Feel free to reach out to us for more insights as we develop part two of this post.