Encoding errors – are they distracting noise or important information that shouldn’t be ignored? There’s so much noise in a big fabric that it can be hard to know what should be investigated. In this blog we will discuss the meaning of a few of the common errors we see and discuss the risk associated with ignoring them as well as the best way to address them.
Encoding Errors in Your SAN Fabric
Encoding errors are low-level errors that indicate encoding disparity inside frames. These are errors that happen with Fibre Channel 1 standard encoding 8 to 10 bits and back, or, with 10G and 16G FC from 64 bits to 66 and back.
Since these happen on the bits that are part of a data frame, these are counted in this column. If the Cyclic Redundancy Check (CRC) Errors increase as well, then there is likely a physical link issue which could be resolved by cleaning connectors, replacing a cable, or replacing the small form pluggable (SFP) and/or Host Bus Adapter (HBA). If there are no CRC errors, then it is likely an issue with the cable only.
Figure 1 below shows the top 20 ports with Encoding Errors. The offending port, slot4 port6, is highlighted in the legend in bold.
In this case, we received an alert about the encoding errors as well as the CRC errors. While looking at the chart of the errors over time as shown in Figure 2, we noticed the same pattern.
Figure 2 shows the number of CRC errors. Encoding errors might lead to a CRC error, however, this metric shows frames that have been marked as invalid frames because of a CRC error earlier in the datapath.
According to FC specifications, it is up to the implementation of the programmer if he wants to discard the frame right away or mark it as invalid and send it to the destination anyway. There are pros and cons to both scenarios.
Essentially, if you see CRC errors it means the port has received a frame with an incorrect CRC, but this occurred further upstream. If the Encoding Errors increase as well, there is a physical link issue which could be resolved by cleaning connectors, replacing a cable or (in rare cases) replacing the SFP and/or HBA.
Fibre channel design best practice stipulates that each host maintains at least two paths to the data in order to maintain redundancy at all times. If one path fails due to pathing issues, then the redundancy is removed and the host is vulnerable to losing connection to the data.
In this case, we started to receive these errors along with a few other errors, and we alerted the customer to check the cable and SFP. During the period from 10:00 am on 11/11/2019 and 11:00 AM on 11/12/2019, during which the host initiator port was throwing these errors, the host only had a single path to the fabric. Fortunately, the other path was functional, but during this time period the availability risk of this server was high.
On 11/12 the cable connection was inspected in the data center and was found to be loose. Someone may have been working on the fabric and bumped the cable. It was reconnected tightly at around 11:00 and the errors and the datapath availability was resolved.
Avoid Risks in your Fabric Environment
In this blog we looked at some of the key indicators that your fabric is behaving well. It is essential that you monitor your fabric and alert when there are errors, as the errors are often a warning that something is going to fail soon or an alert that something has already failed.
It is important to understand the errors so you can treat them in accordance with their severity and impact. In order to effectively do this, you must collect the right data and set up the appropriate alerting mechanisms.
Are you monitoring your fabric for these types of issues? Do you have a way to filter out the false positives? Do you understand what all the errors signify, and which ones can be ignored? If you would like to have a review of the health of you SAN fabric or engage IntelliMagic for proactive monitoring of your fabric please let us know how we can help by sending an email to email@example.com.
Subscribe to our Blogs
Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.
HDS G1500/F1500 Series - Architecture and Performance Analysis
This webinar looks at the key physical and logical components that make up the architecture of the HDS VSP G1500/F1500 series and provides insights into the key related performance metrics.
A Baker's Dozen: 13 Ways to Improve Your SAN Management
Discover how the latest features in IntelliMagic Vision for SAN can help you manage and optimize your rapidly evolving SAN environment.
Why is my Ferrari of a Storage Array Running Like a Yugo?
Improperly configured hosts can suffer from poor performance even on high performing arrays. Like a sports car engine, modern arrays are finely tuned machines, but if fueled with inefficient input you can rob performance and make your Ferrari perform more like a Yugo.