Encoding errors – are they distracting noise or important information that shouldn’t be ignored? There’s so much noise in a big fabric that it can be hard to know what should be investigated. In this blog we will discuss the meaning of a few of the common errors we see and discuss the risk associated with ignoring them as well as the best way to address them.
Encoding Errors in Your SAN Fabric
Encoding errors are low-level errors that indicate encoding disparity inside frames. These are errors that happen with Fibre Channel 1 standard encoding 8 to 10 bits and back, or, with 10G and 16G FC from 64 bits to 66 and back.
Since these happen on the bits that are part of a data frame, these are counted in this column. If the Cyclic Redundancy Check (CRC) Errors increase as well, then there is likely a physical link issue which could be resolved by cleaning connectors, replacing a cable, or replacing the small form pluggable (SFP) and/or Host Bus Adapter (HBA). If there are no CRC errors, then it is likely an issue with the cable only.
Figure 1 below shows the top 20 ports with Encoding Errors. The offending port, slot4 port6, is highlighted in the legend in bold.
Figure 1: Encoding Errors (top 20)
In this case, we received an alert about the encoding errors as well as the CRC errors. While looking at the chart of the errors over time as shown in Figure 2, we noticed the same pattern.
Figure 2: Cyclic Redundancy Check (CRC) Errors
Figure 2 shows the number of CRC errors. Encoding errors might lead to a CRC error, however, this metric shows frames that have been marked as invalid frames because of a CRC error earlier in the datapath.
According to FC specifications, it is up to the implementation of the programmer if he wants to discard the frame right away or mark it as invalid and send it to the destination anyway. There are pros and cons to both scenarios.
Essentially, if you see CRC errors it means the port has received a frame with an incorrect CRC, but this occurred further upstream. If the Encoding Errors increase as well, there is a physical link issue which could be resolved by cleaning connectors, replacing a cable or (in rare cases) replacing the SFP and/or HBA.
Fibre channel design best practice stipulates that each host maintains at least two paths to the data in order to maintain redundancy at all times. If one path fails due to pathing issues, then the redundancy is removed and the host is vulnerable to losing connection to the data.
In this case, we started to receive these errors along with a few other errors, and we alerted the customer to check the cable and SFP. During the period from 10:00 am on 11/11/2019 and 11:00 AM on 11/12/2019, during which the host initiator port was throwing these errors, the host only had a single path to the fabric. Fortunately, the other path was functional, but during this time period the availability risk of this server was high.
On 11/12 the cable connection was inspected in the data center and was found to be loose. Someone may have been working on the fabric and bumped the cable. It was reconnected tightly at around 11:00 and the errors and the datapath availability was resolved.
Avoid Risks in your Fabric Environment
In this blog we looked at some of the key indicators that your fabric is behaving well. It is essential that you monitor your fabric and alert when there are errors, as the errors are often a warning that something is going to fail soon or an alert that something has already failed.
It is important to understand the errors so you can treat them in accordance with their severity and impact. In order to effectively do this, you must collect the right data and set up the appropriate alerting mechanisms.
Are you monitoring your fabric for these types of issues? Do you have a way to filter out the false positives? Do you understand what all the errors signify, and which ones can be ignored? If you would like to have a review of the health of you SAN fabric or engage IntelliMagic for proactive monitoring of your fabric please let us know how we can help by sending an email to info@intellimagic.com.
This article's author
Share this blog
Subscribe to our Blogs
Subscribe to our newsletter and receive monthly updates about the latest industry news and high quality content, like webinars, blogs, white papers, and more.
Related Resources
Improve Collaboration and Reporting with a Single View for Multi-Vendor Storage Performance
Learn how utilizing a single pane of glass for multi-vendor storage reporting and analysis improves not only the effectiveness of your reporting, but also the collaboration amongst team members and departments.
A Single View for Managing Multi-Vendor SAN Infrastructure
Managing a SAN environment with a mix of storage vendors is always challenging because you have to rely on multiple tools to keep storage devices and systems functioning like they should.
HPE 3PAR Performance Best Practices Guide
Learn ways to implement your 3PAR storage such that you have the best possible performance and resiliency in your environment.