Node deployment failed, containerd was not started on the node and the status was disabled
Problem Description
Node deployment failed, and the containerd is not started on the node, with the containerd status being disabled (does not auto-restart).
Checking the /sf/log/today/ske-vm-disk-handler.sh.log log reveals that the partition is already mounted and formatting has failed.
Alert Information
Formatting failed, unable to proceed with subsequent initialization processes.
PS: The today log may not correspond to the actual current date as the log soft link switching is managed by nglogman. The failure to start containerd prevents nglogman and other components from starting, thus failing to perform log rotation and soft link switching.
Effective Troubleshooting Steps
When encountering a mounted partition and formatting failure error, check if the /boot/flag_init_internal_data_disk_success flag exists.
If the disk initialization flag is not present, and if the /sf/cfg/ske-vm-disk-handler/ing/create-vdb-partition-table-succeed flag exists while the /sf/cfg/ske-vm-disk-handler/ing/create-vdb1-succeed flag does not exist, it might indicate that the re-entry has failed.
Root Cause
During the deployment process, a restart occurred, causing the ske-vm-disk process to not complete fully. Although partitioning, formatting, and mounting of the disk were completed, containerd was not started, and it was not enabled either. As a result, containerd did not start during subsequent reboots, and the subsequent initialization did not take place.
Upon another startup, since there is no /sf/cfg/ske-vm-disk-handler/ing/create-vdb1-succeed flag, the process re-executes, but it fails to format due to the partition being already mounted, and also does not proceed to restart containerd.
Solution
Since the node has not been initialized successfully, handle this as a deployment failure. Scale down the node and recreate it to quickly resolve the issue. Scaling down a failed deployment node will not affect the operation of the existing cluster.
Scope of Operation Impact
None
Is this a Temporary Solution?
Yes
This issue is addressed in version 1.1, but not in versions 1.0 and 2.0.
Recommendations and Summary
Deployment failures can correspond to many root causes. When the /sf/log/today/ske-vm-disk-handler.sh.log log consistently reports a mounted partition and formatting failure, this problem can be pinpointed.
Troubleshooting Content
http://docs.sangfor.org/pages/viewpage.action?pageId=386150116
Original Link
https://support.sangfor.com.cn/cases/list?product_id=37&type=1&category_id=29262&isOpen=true

