How A Db2 Newbie Quickly Spotlights the Root Cause of a Db2 Slowdown
Jack Opgenorth - 15 June 2023
Leveraging Familiarity of Solution
I’ll be watching my grandkids closely this Summer as they jump in our backyard pool. The older ones have the confidence to jump off the diving board because they have learned to swim on their own and have been in the pool a few times. However, the younger ones are still approaching the whole experience with some real caution. I’m thankful for that. The main difference for them is experience-based confidence. This is a bit like performance analysts when we are approached with analysis of problems that may be outside our expertise or comfort zone.
A z/OS systems performance person has a high level understanding of the CICS, MQ, and Db2 subsystems, but when asked to find the root cause in something that looks like a Db2 delay, you may feel uncomfortable performing a deep dive into unfamiliar waters.
Fortunately, when recently asked by a customer during our monthly status update to look at a specific time frame to identify the culprit of a recent Db2 slowdown, I wasn’t afraid to take the dive because I had access to super powerful ‘performance goggles.’ There was only 5 minutes left in the call, and it was related to a subsystem, Db2, that I’m familiar with, but by no means an expert.
First Step: Approach the Problem
The questions: When, How long, What, …. were quickly answered by the customer as it related to the issue. Equipped with a couple of answers, an adhoc report is just a few clicks away with a modern Db2 performance solution.
Db2 has one of the most voluminous sets of SMF data available for z/OS subsystems. For most of our customers, it is easily equal to all of the other SMF data combined. Some Db2 customers generate one terabyte/day of compressed data.
The big split between statistics (SMF 100) and accounting (SMF 101) data gives two primary dimensions in which to focus your analysis. Starting down a statistics path first, visual cues of expert-selected metrics and thoughtful sets of reports (thumbnails) with subtitles helped me quickly navigate through the details.
I had a hunch that the description fit a potential lock wait or something along those lines. By the way, I don’t think I could define a latch wait from a lock wait accurately, but a wait is a delay, and I was interested in delays.
Choosing the IntelliMagic health insights report for locking and latching (seen in Figure 1) first helped focus the diagnosis since the customer representatives provided the additional benefit of seeing the visual cues with a screen share.
When a list of 20 data sharing groups showed up for a particular time frame, I could immediately focus on two or three hot spots, noted by the red exception bubbles, which aligned with the details provided. With solutions that only have a tabular view to review the data, it’s easy to miss something that should be standing out.
Figure 1: Finding the best path for quick analysis
Selecting drilling down into time frames, Sysplex, Systems, and further details, I was able to quickly focus the analysis from hundreds of options down to just a few. The highest rated Db2 infrastructure metric showed some stress for one particular data sharing group in the latch conflict suspends.
Both members of the data sharing group were impacted. Since we had little time left in the call, it was important to take a quick look at the details from the database application side, so I poked into the Db2 accounting data reports.
Application Orientation of Response Delays
Db2 accounting data has a number of dimensions to dive into. The statistics had me focused on latch suspends, so I chose to start with some Db2 transaction response times and compare them to last week. Again, the flexibility to navigate and control output greatly simplified the activity so that we could remain focused on what might be the cause of the delay.
The view that kept us on track showed a precipitous drop in the commit rate, and at the same time a heavy spike in response time. A 10-fold increase in a 15-minute average, at the time in question.
Figure 2: Commit Rate drops & Db2 Suspension times skyrocket
This was looking promising, so by having automatic options to drill into the last bit of details (and there are many) we could see that the latch contention – Data Sharing Index Split was the primary cause of delay (suspension) during the spike.
Quick navigation and easy to choose options helped me provide the customer quick identification of a major contributor to a problem that had already consumed many hours of time and analysis. Links to the reports and a short email after the call gave the customer some real evidence of delay causes to the post-mortem call that was happening in minutes.
Figure 3: Honing in on the Primary Issue
Simplicity Searching the Complex
The complexity of Db2 normally requires serious experience to understand the details when looking into delays. However, having experience with an intuitive and intelligent analysis solution significantly improves the productivity of your analysis by offering good starting points, helpful filtering options, and easy navigation.
There are over 20 categories for Db2 class 3 suspend time; there are approximately 30 different categories of latch suspend time. Visualizing many of these together in one tabular report can quickly become overwhelming.
Solutions that provide separate reporting in thumbnails that can be quickly scanned visually (such as IntelliMagic Vision) are much more efficient. Good solutions make the investigation process easier, more approachable, and more productive for your team. This also provides more satisfying work for those doing the analysis.
I much prefer getting to a point where I can do research on what the index split latch suspend means than creating and combing through thousands of tabular reports looking for variances.
So… What is an Index Page Split?
Robert Catterall (IBM) has a nice blog on a very similar issue. He is a Db2 performance expert. If you are experiencing some of these types of suspends in your environment, he also has some recommendations on actions you could take. Quoting directly from his blog here:
“An index page split occurs when Db2 has to insert an entry in an index page because of an insert (or an update of an indexed column) and that page is full. In that situation, a portion of the entries in the page will be moved to what had been an empty page in the index, so that there will be room in the formerly-full page for the new entry. What does Db2 data sharing have to do with this (and in the environment about which I’m writing, Db2 is running in data sharing mode)? In a data sharing system (versus a standalone Db2 subsystem), an index page split action has a greater impact on throughput because it forces a log-write operation.”
The logical next step actions take a bit more space than we have here, and since Robert has already outlined them well in his blog, I would refer to that if you need to investigate further.
Experiential Learning
A familiar toolset encourages one to explore. This is true in all kinds of things but is also true for performance analysis. While I’m not close to being a Db2 expert, my familiarity with a flexible solution that is easy to navigate has enabled me to learn and explore other subsystems within the z/OS infrastructure much more quickly and seamlessly. This encourages me to deepen my understanding of less familiar subsystems within z/OS and helps me become more skilled in these areas.
Rather than using my time to develop expertise in a subsystem specific solution, I can invest that time in reading more details about complex subsystem infrastructure details that drive performance. This is a great way to stay fresh and focused on the next challenge that arises.
This article's author
Share this blog
You May Also Be Interested In:
News
IntelliMagic Vision Adds SMF Field Name Mapping to Live Report Editor
February 27, 2023 | IntelliMagic Vision version 12.1 provides the capability to easily reference and add variables by SMF field names.
News
IntelliMagic Vision Version 12.0 Enhances Collaboration and Training with New Shareable Dashboard Templates
February 6, 2023 | By introducing shareable Dashboard Templates, a platform is created for exchanging technical knowledge on the various z/OS components within the IntelliMagic Vision expert user community.
Blog
Profiling zHyperLink Performance and Usage
In this blog, we demonstrate how to profile zHyperLink performance and usage by reviewing one mainframe site’s recent production implementation of zHyperLink for reads, for Db2.
Book a Demo or Connect With an Expert
Discuss your technical or sales-related questions with our mainframe experts today
Db2 GETPAGE Efficiency Integrating Dataset Performance (42.6) and Db2 IFCID 199 Data
This video shows how integrating Db2 IFCID 199 and SMF 42 data can be leveraged to create a view of overall Db2 GETPAGE efficiency, reflecting the percentages of GETPAGES that are resolved with a buffer pool hit, that result in an I/O but are satisfied by a disk cache hit, and the remaining percentage that generate an actual disk I/O.
More Integrated Visibility Resources
- Leveraging XCF Message Activity for CPU Efficiency
- Troubleshooting WLM Missed Goals with CICS Transaction Data
- Address Space and Db2 Accounting Data
- Dataset Performance (42.6) and Db2 Buffer Pools & Databases (SMF 102/IFCID 199)
- Db2 GETPAGE Efficiency – Dataset Performance (42.6) and Db2 IFCID 199
- Elapsed Time Profiles by Correlation ID: CICS Transaction (110.1) and Db2 Accounting (101) Part 1
- Analysis of CPU By Plan: CICS Transaction (110.1) and Db2 Accounting (101) Part 2
- Insights You Can Gain from Integrated Visibility Across Types of SMF Data
Video Transcript
Okay, another way to leverage, bringing, integrating this data between the IFCID 199 and the SMF 42 data is this view of overall buffer efficiency. GETPAGE efficiency. In other words, what percentage of my GETPAGES are resolved with the buffer pool hit in red here, right? That’s my ideal situation. What percentage result in an IO? But it’s a cache hit that’s in blue. And then what percentage actually end up being an IO and it’s a miss. And so I actually have to do a disk IO.
So, let’s go ahead and look at that information here. Break it out by buffer pool size first, and then now let’s start out by buffer pool. So, you know, here’s that profile across the buffer pools and initially this is sorted as it comes in the order of most GETPAGES.
And so if we wanted to, another way to look at this data would be to look at it by IO rate. Often in buffer pool tuning methodology. Sometimes the focus is on, you know, where am I doing the most IO’s today? Cuz I’m trying to minimize those. So if I do that, I come with a similar view, just kind of the first two pairs of buffer pools, just kind of swapped places when we did that.
All right, so let’s take this again, we saw earlier buffer pool 20 was the one that we’re doing the most IO’s to. So let’s look at that over time. And I’m gonna, I’m gonna remove the buffer pool hits for right now. So I’m, now, I’m just looking at the times I had to do an IO, the percentage of the total GETPAGES that resulted in an IO.
And then also let’s for context, let’s also put the IO rate on here so we have an idea what, how many IO’s we’re talking about. And then finally, I’m gonna just expand this. Instead of looking at a single day, I’m gonna look at a full week. I’ve got a week’s worth of data in this demo database. So here we can see then the pattern of, in terms of the number of IO’s and for the total GETPAGES, what percentage are cache hits and cache misses. So we got kind of 10 to 12% much of the time they’re 10 to 12% were misses at the buffer pool. And then I had to do an IO, but you know, in most cases here they were cache hits.
Okay, and then another way to look at that data instead of by buffer pool, we could look at it by Db2 database. And same, you know, same kind of outlook. And again, we saw earlier this database here, this fourth one was the one that had by far the most IO’s of any database. And we can see here from a percentage point of view that represented slightly, represents slightly more than 10% of the total GETPAGES end up turning into an IO.
Okay, let’s again, let’s look at that profile over time. And again, I’m gonna remove the hits buffer hit so I can see the picture a little bit better. And then again, I’m gonna add in the IO rate here. And again, if I look at this data across the week, I see there was a dramatic drop during the week in the IO rate. And that from other sources, we learned that there was a SQL tuning change that happened during the week that significantly reduced GETPAGES and thus IO’s, so the actual percentage of IO’s is a little bit higher – percentage of GETPAGES that turned into an IO – but the rate of IO’s had decreased dramatically.
And then let’s actually look at the GETPAGE rate here. So now I see the GETPAGES and the IO’s kind of together. And the last, I’ve got midnight spikes going on here, so I’m gonna just kind of zero in on the day shift. So we can see here that the rate of GETPAGES dropped dramatically as a result of that SQL tuning change, which is goodness. And it had lots of other good downstream effects that we do not have opportunity to look at today. And the IO rate is also down not quite as much, which is just a reflection of the fact that there’s a little bit more, little bit higher percentage when there was a GETPAGE miss in the buffer pool.
Okay, so we’ll just look at this dashboard as an example of the kinds of views that we walked through and again, brought together buffer pool data and database data and IO performance and cache data. And being able to bring all those together then gives us, again, additional insights into what’s happening. And even could potentially, as we talked about, inform buffer pool tuning in ways that just one set of data would not provide.
You May Also Be Interested In:
News
IntelliMagic Vision Adds SMF Field Name Mapping to Live Report Editor
February 27, 2023 | IntelliMagic Vision version 12.1 provides the capability to easily reference and add variables by SMF field names.
News
IntelliMagic Vision Version 12.0 Enhances Collaboration and Training with New Shareable Dashboard Templates
February 6, 2023 | By introducing shareable Dashboard Templates, a platform is created for exchanging technical knowledge on the various z/OS components within the IntelliMagic Vision expert user community.
Blog
Profiling zHyperLink Performance and Usage
In this blog, we demonstrate how to profile zHyperLink performance and usage by reviewing one mainframe site’s recent production implementation of zHyperLink for reads, for Db2.
Book a Demo or Connect With an Expert
Discuss your technical or sales-related questions with our mainframe experts today
Managing the Gap in IT Spending and Revenue Growth in your Capacity Planning Efforts
Jack Opgenorth - 15 June 2023
How does your MSU growth compare to your revenue growth? As the new year begins, how are your goals aligning with the business?
While this is purely a hypothetical example that plots typical compute growth rates (~15%) with business revenue growth (~4%), there are many of you that would agree with the widening gap. Whether it is MSU’s or midrange server costs vs. revenue, this graphic is commonplace.
The simplification above ignores seasonality and other variables that may also impact growth, but it demonstrates the problem nearly every CIO has. How can the gap for IT spending vs. revenue and/or profit be managed so that additional spending on other key components of the business can deploy the cash needed to build revenue and profit?
Marginal Capacity Plans
While the overall unit cost of computing continues to drop due to the fantastic advances in the technology over the years, few involved in performance and capacity can claim a long term drop in peak demand ‘resource costs’ over time due to changes your organization has made. In fact, most would share something similar to the above graphic over the last few years. Is your latest projection going to look different?
The actual values observed in 2-5 years for resource demand-and IT costs, as compared to prior forecasts, are often off by large margins. These deviations led a colleague to make the following statement in reference to a new capacity plan: “the only thing I can be certain of with this plan is that it will be wrong”. While there are many possible reasons that are outside your control, it is a cold reality.
And it might ‘hurt’ our pride a bit, but there is some truth there.
Some advice to capacity planners
- Don’t take it personal.
- Planning is useful primarily as a budgeting tool – treat it as such. Don’t expect a beautiful mathematical modeling exercise that predicts the march madness bracket winners correctly – because it won’t!
- The primary drivers influencing changes are typically outside your control – Don’t ignore them; list them, try to obtain some valuable insights from your business and application partners. Focus on the top two or three, (80/20 rule) and lump the rest in an ‘organic growth’ It’s a long tail and quantifying all of it will cost you more in time and money than you can save by budgeting better for 25 different micro services that are less than 1% of the budget each.
- Tell the story with good graphics and succinct insights.
- Identify useful alternatives to just buying more CPUs, memory, disks and network bandwidth.
- Good performance is primary.
What is Good Performance?
Sometimes we in the mainframe arena take “good” performance for granted. We have WLM, capacity on Demand, zIIPs, PAVs, and on down the list.
“Good performance” is meeting all of those response time goals while being efficient. Two people can drive from Denver to San Francisco with similar comfort in an SUV at 8 MPG with some bad spark plugs or in a 48 MPG hybrid vehicle.
Planning your trip does involve the workload and capacity of your current vehicle, but given your situation, our focus is on efficiency. We want to help you stretch that 8 miles per gallon (MPG) to 25 MPG for the SUV. What should the focus be on in the mainframe performance before we produce the plan?
Some efficiency focused metrics worth pursuing include things like:
- latency (response time)
- CPU time per transaction
- Cache Hit %
- Relative Nest Intensity (RNI)
- and so on.
One very visible example from our own Todd Havekost demonstrates the value of lowering your RNI (Relative Nest Intensity) using some configuration changes.
Just a quick refresh on RNI: a lower value indicates improved processor efficiency for a constant workload. A lower RNI drives lower MSU for the same workload, and the results can be significant!
There are several ways to drive for lower RNI, and the reference above gives you several ideas on where to start. Look at how a small change in a performance metric can alter the long-term capacity plan!
Performance Led Capacity Plan
While you don’t often receive a gift like this in your performance options, keep informed.
Part of your capacity planning regimen should be working with your colleagues in systems, DB2, CICS, and applications to solicit changes that might deliver a welcome demand drop and slower future growth. A small change in the rudder, can move a mighty ship!
System Efficiency and Your Capacity Planning Process
Capacity planning is an important input to the budget process. Efficiency recommendations will help you keep your organization lean and productive. Look for some opportunities to improve efficiency as you produce the artifacts necessary for the budget process.
In this first blog, I have provided you one efficiency check to consider. Are there better ways to configure and manage your systems to reduce your RNI? In my next blog, I will open the door for some other ideas to evaluate efficiency as you prepare your capacity plan for the future mainframe growth. Feel free to reach out to us for more insights as we develop part two of this post.
This article's author
Share this blog
You May Also Be Interested In
News
IntelliMagic Vision Adds SMF Field Name Mapping to Live Report Editor
February 27, 2023 | IntelliMagic Vision version 12.1 provides the capability to easily reference and add variables by SMF field names.
News
IntelliMagic Vision Version 12.0 Enhances Collaboration and Training with New Shareable Dashboard Templates
February 6, 2023 | By introducing shareable Dashboard Templates, a platform is created for exchanging technical knowledge on the various z/OS components within the IntelliMagic Vision expert user community.
Blog
Profiling zHyperLink Performance and Usage
In this blog, we demonstrate how to profile zHyperLink performance and usage by reviewing one mainframe site’s recent production implementation of zHyperLink for reads, for Db2.
Book a Demo or Connect With an Expert
Discuss your technical or sales-related questions with our mainframe experts today