Table of Contents
< All Topics
Print

[HCI-VN] After Node host network port is plugged in, Node goes offline (NIC can send packets but cannot receive packets)

Problem Description

Node HCI host X722 NIC, after a port is plugged into the network cable, causes Node to go offline;
The phenomenon is not necessarily offline. It may be that the network port captures the packet and can capture the outgoing request packet but does not receive the reply packet.

Warning Information

none

Effective troubleshooting steps

1. First, we need to capture the packet on site to confirm
After plugging in the cable, ping the external network, such as the Gateway Node, and any test that allows packets to be sent out will do;
Then capture the packet to see if it is only the request packet sent from HCI? If yes, please read on.
2. View network port statistics
There are two situations in which packet loss occurs:

  • Lost in the Physical Network * Lost in the physical NIC
    To confirm whether it is lost in the physical NIC,
  1. Check the network port statistics to see if there is drop/err. If not, check again
  2. Check whether the network port statistics of sent and received packets have increased, i.e. rx packets/tx packets
    If,tx packets do not increase, it means that the packets are lost in the TX direction of the NIC
    If the rx packets do not increase, it means that the request packet has been sent and the reply packet has been lost in the RX direction of the NIC
    In this case, the number of rx packets did not increase:

If the network port is taken over by dp, check: echo -e "show interface eth2"|cli|grep -Ei "drop|err" # If not taken over, check: ifconfig eth2


3. Dump network port register for further confirmation
Basically, the problem can be confirmed in the previous step, but in order to further confirm it, you can dump the NIC register to confirm whether it is a problem with the NIC
PS: The following operation needs to be performed twice, and the second time should be performed after about 1 minute

If NIC is taken over by dpdk, such as dump eth4 register echo -e "dpdk dump reg-info 5 /sf/data/local/eth4_reg_info.txt" |cli

If it is not taken over, dump the kernel port eth8 register realethtool -d eth8 > eth8_reg_info.txt

Run the above command again every 1 minute

Then open the file and compare the results of the two times (you can use software such as Beyond Compare), mainly to see if the RX value is increasing, as follows:
       
If it is not increasing, then there is a problem with the register, which is consistent with the problem in this case.

Root Cause

There is a problem with the NIC
Solution
In this case, you can only restart Node to reset the NIC card register;
NodeRemember to migrate the business before restarting the host)

Operation Impact Scope

Restart NodeNode the services on the current host

Is this a temporary solution?

no

Suggestions and Conclusion

Pay attention to the NIC packet statistics rx_packets and tx_packets

Original Link https://support.sangfor.com.cn/cases/list?product_id=33&type=1&category_id=27933&isOpen=true