Worker node deployment failed and cilium container startup exception
Problem Description
Cluster creation failed, worker node deployment failed, cilium container start-up abnormal
Alert Information
Deployment failure on interface + cilium component not up
Effective Troubleshooting Steps
-
When deployment fails, you can directly use the password
k8sadminto log into the node to check the situation. After deployment succeeds, the password will be randomized. -
Use
crictl pson the node to view the situation, and usecrictl logs -fto find out the failure reason of cilium. -
From the reason, it appears that cilium failed to request the apiserver address, which is the first address in the service network segment, and then it would forward via iptables to the VIP node's IP address.
-
Use
iptables -t nat -S | grep -w ${cilium failure reported IP}(assuming the failure reported IP is 10.96.0.1) to find the relevant chain, then usegrepto find the actual IP address being mapped to (as shown in the below image)
-
Check this IP and find that it is the business port IP of the master node. The master node was successfully deployed and should provide services.
-
Using
curl https://${master-business port IP}:6443 -kfinds that it is not accessible, whilecurl https://${VIP}:6443can access the service. -
Attempt
arping ${master-business port IP}, finding that the MAC address does not match, which likely indicates an IP conflict. Try to find if this IP has been configured in the environment on routes, hosts, or elastic IPs. -
Finally, the customer inspected the environment and found that indeed two IPs were already in use. Previously, the customer had mounted an elastic IP to the router and then forgotten about it, subsequently assigning the already-used elastic IP to a new host, resulting in unexpected worker node deployment failure

Root Cause
IP conflict, leading to cilium on the node being unable to reach apiserver through iptables DNAT mapping, resulting in service access being rejected or timing out.
-
Need to test if the apiserver service is really running. Usually, the node's
/etc/hostsfile accesses through VIP. If it's okay there, but cilium is having issues,
-
Check if you can
pingthrough, check if you can send large packets. -
Check if you get multiple MAC addresses via
arping, or if the address does not match.
Solution
Refer to the above troubleshooting for network issues
Scope of Impact
NA
Is it a Temporary Solution
NA
Recommendations and Summary
NA
Troubleshooting Content
NA
Original Link
https://support.sangfor.com.cn/cases/list?product_id=37&type=1&category_id=29046&isOpen=true

