CAPMVM
Understanding common CAPMVM
errors.
Both CAPMVM and CAPI logs can be found by querying the management cluster.
To see the CAPMVM controller logs, look for the pod called capmvm-controller-manager-XXXXX
in
the capmvm-system
namespace. In those logs you will be able to see the controller
reconcile MicrovmMachine
types and connect to the given flintlock host(s) to
create MicroVMs.
Various CAPI controllers are also running:
- The logs of
capi-controller-manager-XXXX
incapi-system
will show you the overall orchestration of the workload cluster. - The logs of
capi-kubeadm-control-plane-controller-manager-XXXX
incapi-kubeadm-control-plane-system
will show the bootstrapping of the first created MicroVM as a control-plane node. - The logs of
capi-kubeadm-bootstrap-controller-manager-XXXX
incapi-kubeadm-bootstrap-system
will show the bootstrapping of all subsequent MicroVMs as worker nodes.
Failed to create client for Cluster default/lm-demo
failed to create client for Cluster default/lm-demo: Get \"https://192.168.10.25:6443/api?timeout=10s\
This error is expected for the first ~5mins of cluster creation. We are waiting for:
- CAPMVM to instruct flintlock to create a MicroVM
- The MicroVM to boot
- Images to download and the kubelet process to start
The time to do this will vary depending on numerous factors, including your network speeds, the backing storage of your boards and how much memory is available.
If you still see the error after 5 mins, check to see whether the MicroVM booted correctly. See the troubleshooting page for MicroVMs. Common causes for failure are:
- There is not enough RAM allocated to the MicroVM
- The MicroVM cannot access a dhcp service or get an IP
- The MicroVM cannot resolve addresses, so cannot pull required k8s images
- The address chosen for the
kube-vip
is not free or accessible on that subnet - Some other kubelet issue has occurred and the service has not been able to start
Error while dialing dial tcp
failed getting microvm: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 192.168.1.216:9090: i/o timeout\
It means your flintlockd
service is no longer accessible at that address.
Check that flintlock is still running on that board systemctl status flintlockd.service
.
Refer to the flintlock troubleshooting if it is not.
If your flintlockd service is running fine, check that the address is correct, or that it is accessible on that particular interface.
Deleting a cluster hangs forever
Sometimes trying to delete a cluster at the wrong moment of creation (or when something went wrong with that creation to begin with) can cause a delete to hang forever.
Check the CAPMVM logs on your management cluster to see if something is obviously failing.
failed to create client for Cluster <foo>/<bar>: Get \"https://192.168.10.25:6443/api?timeout=10s\
is expected and not an issue
for a delete.
If CAPMVM is trying but unable to connect to a flintlock service, check that it is running.
To force the CAPI cluster deletion, you can edit various objects in order.
Start with the MicrovmMachine
and remove the finalizer.
finalizers:
- microvmmachine.infrastructure.cluster.x-k8s.io
That should do the trick, but if not, proceed to do the same to the Machine
:
finalizers:
- machine.cluster.x-k8s.io
And again to the Cluster
:
finalizers:
- cluster.cluster.x-k8s.io
If there are any MicroVMs still running on the device, you can clean them up
with hammertime
or kill the firecracker
processes and instruct containerd
to remove the lease and clear the content store.
sudo killall firecracker
ctr -n flintlock -a /var/run/containerd-dev/containerd.sock leases ls
ctr -n flintlock -a /var/run/containerd-dev/containerd.sock leases del <id>