AKA architecture


#1

This is a deep dive into the Replicated Airgapped Kubernetes Appliance (AKA) architecture, and a reference for those installing and supporting these installations.

Installation

The kubernetes-init script brings up a single-node Kubernetes cluster running Replicated in a Deployment. Most of the cluster components are brought up with kubeadm. The kubernetes-init script gathers configuration for kubeadm, runs kubeadm init, adds on Weave, adds on Rook, and then installs Replicated.

Preparation for kubeadm init

These steps are performed by the Replicated AKA installer before kubeadm init is invoked

  1. Install Docker on the host. For Ubuntu this will be 1.12.3 from the docker-engine repo and for RHEL it will be 1.13.1 from the yum docker repo. Also check that docker is not configured to use loopback mode with devicemapper. Overlay2 is the preferred storage driver where supported. If devicemapper has to be used because of the kernel version, it must have a thinpool provisioned with a block device. Devicemapper Installation Warning

  2. Install kubelet, kubectl, and kubeadm on the host. These come from deb/rpm packages, which are bundled into a docker image and loaded on the customer’s machine.

  3. /opt/replicated/kubeadm.conf This file is generated from the flags and prompts in the kubernetes-init script.

  4. Disable SELinux. Kubeadm is expected to be able to bring up a cluster that works with SELinux enforcing in the 1.14 release. Currently it’s not possible to run with SELinux. https://github.com/kubernetes/kubeadm/issues/279

kubeadm init

The kubeadm config file we created at /opt/replicated/kubeadm.conf is expanded with defaults. The full config can be viewed with kubeadm config view. This is used to configure the following components required for a Kubernetes cluster:

Kubelet

The kubelet section of kubeadm config is used to create the file /var/lib/kubelet/config.yaml and then the kubelet systemd service is started.

Static pods in the control plane (master only)

The kubeadm config is used to customize the command flags passed to four static pods that make up the control plane. The yaml config for these pods is found in /etc/kubernetes/manifests. Kubelet will run anything in this directory as a static pod.

  1. Kube-controller-manager
  2. Kube-apiserver
  3. Kube-scheduler
  4. etcd

These static pods run in the kube-system namespace, so once the cluster is running you can view these pods with kubectl -n kube-system get pods and get logs from them in the normal way.

Non-static System Pods

These components are also deployed to the kube-system namespace, but not as static pods. They can be scheduled on worker nodes and can be edited with kubectl. The Kubernetes cluster would still be able to run pods without these services, but DNS and service networking would not work.

  1. CoreDNS (Deployment)
  2. Kube-proxy (DaemonSet)

Networking

Pod Networking

After kubeadm init has completed, Replicated deploys weave as the CNI plugin for Kubernetes. Weave is deployed as a DaemonSet in the kube-system namespace. The pod started on each node copies the weave-ipam and weave-net binaries to the /opt/cni/bin directory to be called directly by kubelet when creating pod sandboxes.

Weave is responsible for implementing the Kubernetes networking model. It assigns an IP address to every Pod, and ensures IP packets can be routed between pods and between nodes and pods.

IPAM - weave will assign IPs to pods from the subnet 10.32.0.0/12 unless another subnet was passed to the ip-alloc-range flag of the kubernetes-init script. Weave will set up a routing rule on every host so that all traffic addressed to an IP in the 10.32.0.0/12 subnet is routed to the weave interface. The weave interface is a bridge and can be viewed with ip -d link show weave. All pods on the same host have a virtual ethernet interface pair with the host end in the weave bridge. For clustered installs with multiple nodes there will also be a VTEP for each remote node attached to the weave bridge. Traffic destined for a Pod IP on a remote node will be forwarded to the remote weave bridge through the appropriate VTEP and then delivered locally.

Troubleshooting Weave

Service Networking

Most cluster traffic is addressed to a service IP rather than a Pod IP. Kube-proxy is responsible for ensuring that traffic addressed to a service IP gets routed to a Pod IP. A service is essentially an in-cluster load balancer routing traffic to multiple upstreams. You can see the backends available for every service by running kubectl get endpoints <service>.

Troubleshooting Services

Cluster DNS

CoreDNS allows in-cluster clients to address services by hostname rather than by IP. Every pod gets a simple /etc/resolv.conf with a single nameserver, 10.96.0.10. This is the service IP of the K8s DNS service, which for legacy reasons is still named kube-dns. It resides in the kube-system namespace along with the CoreDNS deployment. The CoreDNS pods have an /etc/resolv.conf created from the host’s. If a request does not match any cluster services, it will be forwarded to the same nameservers serving the host. Note that only the first 2 nameservers and the first 3 search records from the hosts /etc/resolv.conf will be used.

Troubleshooting DNS

Firewalls

Storage

Storage Checklist

  • At least 1 GB of disk space at /var/lib/etcd
  • At least 40 GB of disk space at /opt/replicated
  • At least 10 GB of disk space at /var/lib/docker

Ceph

Rook will use the directory /opt/replicated/rook for storage for provisioning PersistentVolumes on every host. An OSD will be created for every node to manage this directory. These can be viewed in the rook-ceph namespace. Additionally, three MONs will be created in the same namespace to supervise the cluster. A single MGR will also be created in the rook-ceph namespace to publish the Ceph dashboard.

You can manually configure the Ceph cluster and pool by using kubectl -n rook-ceph edit cluster rook-ceph and kubectl -n rook-ceph edit pool replicapool.

Rook

The Rook Operator and agents will be created in the rook-ceph-system namespace. The Rook Agent is a DaemonSet. When each pod starts on a new node, it will copy its FlexVolume plugin binary to /usr/libexec/kubernetes/kubelet-plugins/volume/exec/ceph.rook.io~rook-ceph-system. The plugin will be called by kubelet when creating Pod sandboxes and will send a request to the kernel to create a new block device backed by ceph. These block devices can be viewed with lsblk and will be named rbd0, rbd1, etc.

The Ceph dashboard provides information on cluster health and the status of OSDs and MONs. It can be found in the OnPrem Console under the /ceph path.

Troubleshooting Rook

General Troubleshooting

Check disk space

Kubelet will begin pruning unused images when the system disk usage hits 80%, and will kill running containers at 85%. If either threshold is met, the system may become unrecoverable.

df -h

Check Node Status

kubectl describe node replicated-test-11

Look for any conditions to be true (e.g. MemoryPressure, DiskPressure)

Check whether all pods are running

kubectl get pods --all-namespaces 

Check just the application pods

kubectl get pods --namespace=replicated-<appid>

kubectl describe pod <pod-name> --namespace=replicated-<appid>

Check Docker logs

journalctl -u docker

Check if kubelet is up

sudo systemctl status kubelet

Check kubelet logs

Kubelet runs on every Kubernetes node. If there are errors here, they’re likely to prevent container deployments and/or communication.

journalctl -u kubelet

Get Replicated logs (regular and UI)

kubectl logs -l tier=master -c replicated
kubectl logs -l tier=master -c replicated-ui

Run a command in a weave container to check status

First find the weave pod name running on the node in question:

kubectl -n kube-system get pods -o wide

Then exec into that pod to run status commands:

kubectl -n kube-system exec -it weave-net-92vcs /bin/sh
# ./weave --local status
# ./weave --local status connections