Profiling zHyperLink Performance and Usage
Dennis Moore - 21 August 2023
Although zHyperLink has been available for several years now, many mainframe sites have yet to implement the technology into their production environment. In this blog, we’ll review one site’s recent production implementation of zHyperLink for reads, for Db2. We will provide examples of reporting at the PCIe adapter, volume level and data set level which you may find helpful to analyze your own results.
Introduction to zHyperLink
Over the years, there have been many significant advances in I/O technology. Today, I/O is really, really fast. So how can we gain further improvement? One approach is to take a subset of I/Os for data that resides in close proximity to the processor, add a new higher-speed link, and make them synchronous in order to avoid interrupt processing and task switching. This is what zHyperLink does.
Figure 1 shows the physical connectivity between the CPC and the Storage Control Unit.
Figure 1: zHyperLink physical connectivity (Source: IBM Redbook “Getting Started with IBM zHyperLink for z/OS” Page 20)
zHyperLink provides a short distance direct connection of up to 150 meters. I/Os are synchronous, eliminating the overhead of z/OS dispatcher delays, I/O interrupt processing and processor cache reload. Latency improvements up to 4.5 times are possible with current hardware (z15s and DS8950Fs in this example). Think of 20 microsecond I/O times, as we’ll see below.
There is a CPU time price to be paid, since the I/Os are synchronous and the processor waits for them to complete. And there are limitations on which data sets and I/Os are eligible. However, the benefits outweigh the costs for most customers. If you need assistance analyzing your workloads for zHyperLink candidates and assessing the anticipated impacts, please contact us at IntelliMagic.
zHyperLink Connectivity Best Practices
Let’s look at data from a site who recently implemented zHyperLink and review the kinds of reports you can produce for your own analysis.
We’ll begin by verifying that our configuration adheres to best practices for connectivity. Figure 2 below shows a single LPAR and a single DS8950F subsystem. We see there are two Physical Channel IDs for PCIe, each with four PCIe Function IDs.
Figure 2: Best practice configuration for zHyperLink connections
Per best practices, you should be configured this way throughout your environment – 4 PFIDs per LPAR per zHyperLink connection (CEC, adapter ID, port #). We can also see the even distribution of I/O across PFIDs and ports. Including your other LPARs would also allow you to identify the busier LPARs.
Profile zHyperLink Usage and Performance
Now let’s look at our mix of zHyperLink and FICON I/O to start to profile our usage and performance.
Total I/O Rate
Figure 3 shows I/O rates over the course of a single day, from midnight to midnight.
Figure 3: Total I/O rate including breakdown of synchronous I/Os
Most of the I/Os are native FICON as usual, but you now also see spikes of successful synchronous reads and some synchronous read cache misses. Note that only synchronous reads are currently enabled in this data, so no writes are shown.
Synchronous I/Os
Figure 4 examines the synchronous I/O without the native FICON, for easier visibility into patterns.
Figure 4: Synchronous I/O breakdown including hits and cache misses
We see there are three main spikes of activity, at 0100, 0700 and 1600, with more than 10,000 synchronous I/Os at peak. We also see the read cache misses are a small percentage of the synchronous I/Os and follow the peaks, as you might expect. Read cache misses are re-driven as native FICON I/O which incurs overhead, so you’ll want to keep an eye on misses going forward.
Synchronous Latency
Figure 5 compares latency to I/O rate for the synchronous read requests.
Figure 5: Synchronous latency versus I/O rate
The top red line shows latency, or response times, varying from 16-22 microseconds (way fast!) over the course of the day. The blue line of three peaks should look familiar by now, as the request rate. Note that as the request rate goes up, the response time comes down and is consistently under 20 microseconds.
Synchronous Read Hits
We can also report synchronous hits and misses by data set name. Figure 6 shows a daily summary by data set name, but you could display any of these over time as well.
Figure 6: Synchronous read hits and misses by data set
This site turned on synchronous read processing for Db2, so we’re looking at the hit percentage for production Db2 data sets.
Time Pattern for Synchronous Requests
With any new implementation, it can be helpful to understand the time patterns that apply to the workload involved. Figure 7 shows that the single day we’ve been looking at is typical.
Figure 7: Time pattern for synchronous requests
This Db2 workload drives most of its synchronous I/O in the 0100-0300 hours consistently, with little change by day of the week.
Data Set Level Performance
Another way to look at data set level performance is shown in Figure 8.
Figure 8: Times and latency for individual data sets
Figure 8 shows the top 30 linear data sets with ‘DBP1’ in their name, with response times and time of day when they were accessed. Reports like this, along with the related reporting in IntelliMagic Vision, provides a strong capability to profile your workload and understand technology benefits or diagnose problems or changes.
Demonstrating the Benefit of Improved Technology
zHyperLink utilizes improved technology including the PCIe bus and synchronous I/O to significantly improve latency for applications. This improvement can be shown explicitly from the RMF/SMF data and is key to justifying new technologies and new projects.
IntelliMagic Vision gives you an easy and comprehensive way to verify the benefit, not just for zHyperLink but for the many enhancements your teams implement each year.
This article's author
Share this blog
You May Also Be Interested In:
Webinar
Where Are All The Performance Analysts? - A Mainframe Roundtable | IntelliMagic zAcademy
We have brought together industry experts from across the board to discuss the importance of having an efficient team managing the performance of their business applications.
Blog
How A Db2 Newbie Quickly Spotlights the Root Cause of a Db2 Slowdown
This blog explores how a non-Db2 expert quickly identified latch contention arising from a Data Sharing Index Split as the root cause of a Db2 delay/slowdown.
Blog
Integrating Dataset Performance (SMF 42.6) and Db2 Buffer Pools & Databases (SMF 102/IFCID 199) Data
Dataset performance data from SMF 42.6 records provides disk cache and response time data at the Db2 buffer pool and Db2 database levels when integrated with Db2 Statistics IFCID 199 data.
Book a Demo or Connect With an Expert
Discuss your technical or sales-related questions with our mainframe experts today
Interactive FICON Topology Viewer Spotlights Configuration Errors
21 August 2023
Having an accurate picture of FICON topology is essential for identifying configuration errors or sub-optimal configuration within the z/OS infrastructure. With the release of 12.5.0, IntelliMagic Vision introduced an interactive FICON Topology Viewer. Using the FICON Topology Viewer, performance analysts can now visualize, and interact with, their entire FICON infrastructure.
Until now, mainframe analysts hoping to achieve this visualization have relied upon manually printed, static visualizations of their topology – often taped to a wall – in order to evaluate their FICON configuration and spot errors. IntelliMagic Vision users are now able to save countless hours of manual examination in their analysis.
Use Cases
The FICON Topology Viewer helps analysts identify configuration errors, ensure that the infrastructure is configured correctly, and reveal undesirable infrastructure changes.
Example use cases include:
- FICON Channel Speed is Auto-Negotiated To A Lower Speed: This issue often occurs when a component in the FICON infrastructure cannot run at the faster speed. Using the FICON Topology view, analysts can look at the individual ports/connections for the CEC, FICON Switch and Disk, and determine where the problem is. This may be a microcode issue, a hardware component problem or simply a configuration issue.
- LPARs Running at a Different FICON Speed to the Same Disk/Tape Units: The FICON Topology view allows you so see if all of the LPARs and connections are running at the same and desired speed.
- Verify All LPARs Have Desired Number Of FICON Connections to the Specific Device: Typically, FICON connections to Disks and Tapes are defined for not only performance and throughput, but also availability. The FICON Topology Viewer allows analysts to verify that all LPARs have the desired number of FICON connections to the specific device.
- Verify the Infrastructure is Correctly Defined For Emergency Site Fail-Over: The FICON Topology view allows analysts to verify that the primary and secondary disk systems have the same infrastructure configuration on both sites.
- Verify the Configuration and Optimize Component Usage: Drilldowns are available in the FICON Topology Viewer to show specific charts for that specific component. For example: drilling down on a Disk can show the front-end adapter utilization in a very intuitive min/max/average chart for all of the adapters. This allows analysts to verify that all the components are being used and have a similar utilization.
- Identify System Outages or Offline FICON Channels: The time-selection and compare feature within the FICON Topology Viewer allows analysts to identify issues such as FICON channels being put offline (possibly due to error conditions) or system outages (LPAR IPLs).
The video below demonstrates the FICON Topology Viewer and how access to interactive data with the FICON topology enables analysts to easily spot changes and assess their configuration.
You Might Also Be Interested In:
Webinar
Where Are All The Performance Analysts? - A Mainframe Roundtable | IntelliMagic zAcademy
We have brought together industry experts from across the board to discuss the importance of having an efficient team managing the performance of their business applications.
Blog
How A Db2 Newbie Quickly Spotlights the Root Cause of a Db2 Slowdown
This blog explores how a non-Db2 expert quickly identified latch contention arising from a Data Sharing Index Split as the root cause of a Db2 delay/slowdown.
Blog
Integrating Dataset Performance (SMF 42.6) and Db2 Buffer Pools & Databases (SMF 102/IFCID 199) Data
Dataset performance data from SMF 42.6 records provides disk cache and response time data at the Db2 buffer pool and Db2 database levels when integrated with Db2 Statistics IFCID 199 data.
Book a Demo or Connect With an Expert
Discuss your technical or sales-related questions with our mainframe experts today
Managing the Gap in IT Spending and Revenue Growth in your Capacity Planning Efforts
Jack Opgenorth - 21 August 2023
How does your MSU growth compare to your revenue growth? As the new year begins, how are your goals aligning with the business?
While this is purely a hypothetical example that plots typical compute growth rates (~15%) with business revenue growth (~4%), there are many of you that would agree with the widening gap. Whether it is MSU’s or midrange server costs vs. revenue, this graphic is commonplace.
The simplification above ignores seasonality and other variables that may also impact growth, but it demonstrates the problem nearly every CIO has. How can the gap for IT spending vs. revenue and/or profit be managed so that additional spending on other key components of the business can deploy the cash needed to build revenue and profit?
Marginal Capacity Plans
While the overall unit cost of computing continues to drop due to the fantastic advances in the technology over the years, few involved in performance and capacity can claim a long term drop in peak demand ‘resource costs’ over time due to changes your organization has made. In fact, most would share something similar to the above graphic over the last few years. Is your latest projection going to look different?
The actual values observed in 2-5 years for resource demand-and IT costs, as compared to prior forecasts, are often off by large margins. These deviations led a colleague to make the following statement in reference to a new capacity plan: “the only thing I can be certain of with this plan is that it will be wrong”. While there are many possible reasons that are outside your control, it is a cold reality.
And it might ‘hurt’ our pride a bit, but there is some truth there.
Some advice to capacity planners
- Don’t take it personal.
- Planning is useful primarily as a budgeting tool – treat it as such. Don’t expect a beautiful mathematical modeling exercise that predicts the march madness bracket winners correctly – because it won’t!
- The primary drivers influencing changes are typically outside your control – Don’t ignore them; list them, try to obtain some valuable insights from your business and application partners. Focus on the top two or three, (80/20 rule) and lump the rest in an ‘organic growth’ It’s a long tail and quantifying all of it will cost you more in time and money than you can save by budgeting better for 25 different micro services that are less than 1% of the budget each.
- Tell the story with good graphics and succinct insights.
- Identify useful alternatives to just buying more CPUs, memory, disks and network bandwidth.
- Good performance is primary.
What is Good Performance?
Sometimes we in the mainframe arena take “good” performance for granted. We have WLM, capacity on Demand, zIIPs, PAVs, and on down the list.
“Good performance” is meeting all of those response time goals while being efficient. Two people can drive from Denver to San Francisco with similar comfort in an SUV at 8 MPG with some bad spark plugs or in a 48 MPG hybrid vehicle.
Planning your trip does involve the workload and capacity of your current vehicle, but given your situation, our focus is on efficiency. We want to help you stretch that 8 miles per gallon (MPG) to 25 MPG for the SUV. What should the focus be on in the mainframe performance before we produce the plan?
Some efficiency focused metrics worth pursuing include things like:
- latency (response time)
- CPU time per transaction
- Cache Hit %
- Relative Nest Intensity (RNI)
- and so on.
One very visible example from our own Todd Havekost demonstrates the value of lowering your RNI (Relative Nest Intensity) using some configuration changes.
Just a quick refresh on RNI: a lower value indicates improved processor efficiency for a constant workload. A lower RNI drives lower MSU for the same workload, and the results can be significant!
There are several ways to drive for lower RNI, and the reference above gives you several ideas on where to start. Look at how a small change in a performance metric can alter the long-term capacity plan!
Performance Led Capacity Plan
While you don’t often receive a gift like this in your performance options, keep informed.
Part of your capacity planning regimen should be working with your colleagues in systems, DB2, CICS, and applications to solicit changes that might deliver a welcome demand drop and slower future growth. A small change in the rudder, can move a mighty ship!
System Efficiency and Your Capacity Planning Process
Capacity planning is an important input to the budget process. Efficiency recommendations will help you keep your organization lean and productive. Look for some opportunities to improve efficiency as you produce the artifacts necessary for the budget process.
In this first blog, I have provided you one efficiency check to consider. Are there better ways to configure and manage your systems to reduce your RNI? In my next blog, I will open the door for some other ideas to evaluate efficiency as you prepare your capacity plan for the future mainframe growth. Feel free to reach out to us for more insights as we develop part two of this post.
This article's author
Share this blog
You May Also Be Interested In
Webinar
Where Are All The Performance Analysts? - A Mainframe Roundtable | IntelliMagic zAcademy
We have brought together industry experts from across the board to discuss the importance of having an efficient team managing the performance of their business applications.
Blog
How A Db2 Newbie Quickly Spotlights the Root Cause of a Db2 Slowdown
This blog explores how a non-Db2 expert quickly identified latch contention arising from a Data Sharing Index Split as the root cause of a Db2 delay/slowdown.
Blog
Integrating Dataset Performance (SMF 42.6) and Db2 Buffer Pools & Databases (SMF 102/IFCID 199) Data
Dataset performance data from SMF 42.6 records provides disk cache and response time data at the Db2 buffer pool and Db2 database levels when integrated with Db2 Statistics IFCID 199 data.
Book a Demo or Connect With an Expert
Discuss your technical or sales-related questions with our mainframe experts today