What can you do if your Cache Hit Percent drops too low and affects your tape performance?
What can you do if some of your tape jobs seem to be running too long and you observe that your Cache Hit Percent for your TS7740 virtual tape system is low?
First, let’s define Cache Hit Percent. When host systems mount a tape, there are 3 possibilities:
- A scratch mount – this is always a cache hit as long as the TS7700 fast ready categories are configured correctly. These are called Fast Ready mounts by the TS7700.
- A specific mount where the volser being requested is already in the TS7700’s cache. These are called Cache Read Hit mounts by the TS7700.
- A specific mount where the volser being requested is not in the TS7700’s cache. These are called Cache Read Miss mounts by the TS7700.
The first two mount types usually result in very quick tape mount times on the order of 1-2 seconds. The mount time for a Cache Read Miss mount is much longer since the data from the volser must be retrieved from a tape cartridge back into the cache before the host mount can complete. Cache Hit Percent for an interval is the sum of the Fast Ready mounts and the Cache Read Hit mounts divided by the Total virtual mounts (i.e. Fast Ready + Cache Read Hit + Cache Read Miss).
So, a low Cache Hit Percent means that more tape mounts are encountering longer mount times associated with a Cache Read Miss mount and may mean that jobs take longer. A general rule of thumb is the Cache Hit Percent should be at least 80%, but greater than 90% is preferable.
IntelliMagic Vision for zOS Tape can help us look at various aspects of the TS7700 to determine what actions (if any) need to be taken.
1) First we need to examine whether this is really a significant problem. Using IntelliMagic Vision’s thresholds and dashboards, we can determine when the Cache Hit Percentages are low. But the question is, “Are these having a major impact?” Perhaps the Cache Hit Percentage is low for a short period because there are very few mounts and/or very few scratch mounts during that time. We could examine the Cache Miss mounts chart to determine the absolute number of Cache Read Miss mounts. A small number, say 1-2 per minute, in the interval is probably not a concern. We should be concerned when the number of Cache Miss mounts is high, meaning a larger impact on tape mount performance.
Sample IntelliMagic Vision graphs are on the next page showing Cache Hit Percentage and Cache Miss Mounts for the same time period. Note that although the Cache Hit Percentage drops below 80% for a number of intervals, the Cache Miss mounts are usually less than 20 mounts (for a 15 minute interval) for all but a few of the intervals.
2) Now, let’s assume that there is a concern or problem. What else should we look at?
The first place to look is the average age of the oldest volume in cache. These graphs are under the TS7700 Cache focal point in the Pref. Group set. We want to look at the ages for the PG1 performance group. There are graphs for 35 day averages, 48 hour averages and 4 hour averages. I usually look at the 48 hour numbers. We want the average age to be at least 24 hours and preferably closer to 48 hours or more. Why is this? Many production batch tape jobs will read back in a tape that was written 24-hours earlier in the previous day’s production batch cycle. With an age of 24-hours or more these should end up as Cache Read Hit mounts.
If our average ages are less than 24-hours then we should be looking at licensing or adding additional cache within the TS7740. Perhaps the tape workloads have grown over time and the existing cache is no longer adequate for these increased workloads.
3) Let’s assume our Cache Average Ages look good. What else can we do?
Perhaps there are tape workloads within the TS7740 that are not very cache friendly and should be moved to another platform. We can use IntelliMagic Vision to examine the Miss mounts chart to determine if there are certain patterns of periods when the Miss mounts are higher. Then, we can use the IntelliMagic Vision Datasets focal point to isolate those periods and determine what datasets are being read.
Example 1: A high number of cache miss mounts occur on Saturday morning from 09:00 AM to 11:00 AM. We determine that this is due to a weekly job reading in a number of tapes created the previous week, or a consolidation job reading and consolidating some daily logs into a weekly log.
It is unlikely that we want to buy enough cache to keep these volumes in cache for 7 days, as it would probably be too expensive for this issue if it were the only one.
- We could accept the fact that this is going to happen and adjust thresholds in IntelliMagic Vision to allow this to not show as an exception.
- We could move these jobs to another platform such as native tape drives or a TS7720 platform.
- We could change the consolidation jobs to run daily accumulating a week’s worth of logs by adding each day’s logs at the end of the tape (i.e. DISP=MOD). This eliminates the 7 day reads.
Example 2: The high number of cache miss mounts occurs on weekdays from 08:00 to 17:00. This is a common occurrence. Perhaps it is because we have one or more Archive/Retrieval applications (e.g. HSM ML2, SAR, Mobius ViewDirect, Job History System, etc.) being heavily accessed during the prime shift. Archive/Retrieval applications are typically not cache friendly, as they usually access data sets randomly.
If the number of Cache Miss mounts is not large or the mount time for these applications is not critical, then they could be left as-is with exception rules customized in IntelliMagic Vision to allow for this. Otherwise, these might be good applications to move to another tape platform such as native drives or the TS7720. The TS7720 is an excellent platform for these Archive/Retrieval applications as all reads should be cache hits in the TS7720 with its large amount of disk cache.
4) Finally, if none of the above can reduce the number of Cache Miss mounts, then we need to make sure that we have enough back-end tape drives to satisfy all of the Tape Recalls that these Cache Miss mounts create. Contention and queuing for these tape drives can elongate the Cache Read Miss mount times even further than normal.
1) Determine if the absolute number of Cache Misses requires action
2) Determine if the Average Cache Age is still adequate
3) Determine if any workload changes are needed
4) Make sure adequate resources (e.g. backend tape drives) are in place to support the Cache Misses.
This, of course, cannot cover all of the various possibilities.
Have you taken any other actions when your Cache Hit Percent has been low?
How to Avoid Application Infrastructure Performance Problems
"What are the top 5 million things you need to do today to avoid application infrastructure performance problems?"
Best Practices for Monitoring Oracle/STK VSM Usable MVC Space
Automate how you monitor the health of the MVC Storage Classes by following these 5 best practices.
Application Design Issues Cause Low Throughput to Virtual Tape
IntelliMagic Vision proactively identifies risks in your virtual tape environment and highlights potential issues before they have fully developed.