Replicated Pod stuck in ContainerCreating state

Symptom

The Replicated Kubernetes install script can hang with a spinner at the Await Replicated Ready step. This may be an indication that its PersistentVolume cannot be mounted. This happens when the kubelet service has failed to detect that the FlexVolume plugins in the /usr/libexec/kubernetes/kubelet-plugins/volume/exec directory have been added.

Fix

The fix is to run systemctl restart kubelet. Kubelet will probe the volume plugin directory when it restarts and be able to mount the Persistent Volume to the Replicated pod.

Investigation

There are a couple ways to confirm that the problem stems from kubelet’s dynamic volume plugin discovery mechanism.

  1. Run journalctl -u kubelet | grep desired_state_of_world_populator. You should see error logs containing the message Failed to add volume "replicated-persistent"

  2. Use kubectl to get the logs of the Rook agent pod running in the rook-ceph-system. The last line of the logs should be agent-cluster: start watching cluster resources, indicating that it has never been called by the FlexVolume binary to mount a PersistentVolume to a Pod. Note that you will have to find the the agent running on the node with the failed mount. During install there will be only one node and therefore only one agent.

For the sake of searchability, here are example logs that demonstrate the symptoms

From Kubelet:

Mar 14 00:00:39 ip-172-31-4-181 kubelet[6811]: E0314 00:00:39.176406    6811 desired_state_of_world_populator.go:309] Failed to add volume "xxx" (specName: "pvc-aaaa-3f8a-11e9-a2d7-0a5db2fb36a4") for pod "aaaa-45ec-11e9-a2d7-0a5db2fb36a4" to desiredStateOfWorld. err=failed to get Plugin from volumeSpec for volume "pvc-aaaa-3f8a-11e9-a2d7-0a5db2fb36a4" err=no volume plugin matched

In kubectl get events or scheduler/kubernetes/resources/events/resource.json in a support bundle

timeout expired waiting for volumes to attach or mount for pod \"default\"/\"replicated-shared-fs-snapshotter-aaaaaaaa-dw5qx\". list of unmounted volumes=[rook-shared-fs]. list of unattached volumes=[replicated-sidecar-tls-vol rook-shared-fs default-token-9bbs]