What sort of self-healing / auto-scaling is available to clusters deployed on-premise?
The Ceph cluster will automatically rebalance storage in the case of node failure. It will enable a data replication factor computed as max(3, cluster_size).
When worker nodes are lost, Kubernetes will rebalance workloads. As long as sufficient compute resources are available, the cluster can tolerate the loss of any number of worker nodes without downtime. If the number of lost worker nodes is so great so as to render some pods unschedulable, the pods will be scheduled once additional nodes are booted by the cluster operator.
A 3-master cluster can tolerate the loss of a single master node without failure. A 5-master cluster can tolerate the loss of 2 master nodes without failure.
At the moment, Replicated doesn’t communicate with any hypervisor to auto-replace lost nodes. Lost nodes must be replaced by the end customer IT admin, or by an automated system that they configure. Because new nodes can be joined on boot via optional mounted config files, this process can be automated fairly easily.