Small Changes Can Lead to Big Problems If Not Addressed
You may think that seismology and storage don’t have a lot in common, but I found a correlation between the two while I was looking at some Dell EMC VMAX data for one of our customers.
One of the purposes of a seismograph is to detect and trend tremors caused by stressors in underground rocks that can be precursors to a full-blown earthquake. Stressors in your storage infrastructure can also be harbingers of a performance issue. The difference is that we can address storage tremors before they become a full-blown performance quake.
Statistics, statistics, statistics
Statistical change detection in your SAN performance solution can compare the current day’s performance to that of the previous thirty days. Thirty days is a good baseline for comparison, as it catches weekends, workdays, and end of month workloads. Once you know what ‘normal’ is, you can then spot and trend deviations from that baseline.
A standard deviation is a number used to tell how measurements are spread from the average. Any standard deviation greater than ±1 from the average is notable. The lower the standard deviation, the less variance from the average. Figure 1 shows the standard deviation bell curve for a normal distribution.
Note that 68.2% of all samples in a normal distribution fall into ±1 standard deviation and 95.4% are within ±2 standard deviations. Only the last 4.6% of samples are in the 3 or more standard deviations category and are exceptional.
Was that a tremor?
Using the statistical change detection built into IntelliMagic Vision, I noticed a change in both overall front-end response time as well as front-end read response time of almost 3 standard deviations while looking at a customer’s Dell EMC VMAX data, as shown in Figure 2. That is a significant deviation that should be investigated further.
When plotted across the past month, you can see that the response time on this array has increased markedly over the past week, as seen in Figure 3.
The yellow line shows the average read response time for the month, the blue line shows the daily average, and the red line shows the read response time over the past 30 days.
When you look on the storage array for the day in question, there are peaks when the read response time exceeds the warning threshold, but overall the front-end read performance for this storage array is good as indicated by the green border of Figure 4. You can clearly see, however, the periods of increased response time.
This Dell EMC array has two storage pools, only one of which is showing the increased read response time, as shown in Figure 5.
Within Pool_2, there is only one volume that is showing a significant increase in front-end response time, volume 00306, as shown in Figure 6.
When we look at cluster read and write throughput, we see the same I/O pattern in read and write throughput that corresponds with the increased front-end response time for cluster SQL_CLS_001 as shown in Figure 7.
Furthermore, when we look at I/O Intensity, a measure of impact on response time (or I/O rate * response time), we see the same pattern on the volume we looked at earlier, volume 00306, as shown in Figure 8.
Finally, Figure 9 compares the I/O intensity for cluster SQL_CLS_001 to that of a week ago. You can see how dramatically the workload has increased.
Although this increased workload isn’t yet affecting other hosts on the storage array, it is a dramatic change that is increasing over time. You now have the opportunity to work with the owners of SQL_CLS_001 to see what has changed to cause the dramatic workload increase and proactively adjust the environment before it causes an issue.
Finding the epicenter
Using multiple seismographs, a scientist can triangulate to find the epicenter of an earthquake. Similarly, using the right diagnostic equipment can allow you to look at different aspects of your storage array’s performance to find the source of the changing workload.
Manually comparing current statistics to a baseline is labor-intensive and cumbersome. IntelliMagic Vision automatically performs the statistical change detection analysis and points out deviations in the workload. Unlike an earthquake, we can address a tremor before it becomes an earthquake that causes poor performance or an outage. Wait… Did you just feel that?
Subscribe to our Blogs
Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.
Why is my Ferrari of a Storage Array Running Like a Yugo?
Improperly configured hosts can suffer from poor performance even on high performing arrays. Like a sports car engine, modern arrays are finely tuned machines, but if fueled with inefficient input you can rob performance and make your Ferrari perform more like a Yugo.
NetApp C-Mode Architecture and Performance Analysis 101
This webinar looks at performance and capacity for your NetApp arrays running ONTAP Cluster-mode.
HDS G1500/F1500 Series - Architecture and Performance Analysis
This webinar looks at the key physical and logical components that make up the architecture of the HDS VSP G1500/F1500 series and provides insights into the key related performance metrics.