After deploying some of the largest IoT systems in North America over the past few years Leverege learned some lessons with Kubernetes the hard way. Due to a huge community that Kubernetes has garnered, you can now easily find lots of tips online. However, we recommend that you start simple. Before messing around with complex namespace configurations, Spinnaker/Istio, and pod policies, start with the default namespace and learn the basics before you secure your cluster for production.
Google provides preemptible VMs that provide no availability guarantees and last for 24 hours maximum at a heavy discount (~5x cheaper). Although preemptibles were initially marketed for running short term batch jobs, with multiple node pools running on GKE, you can run production loads at scale on a mix of preemptibles and reserved instances. Using node pools, we can leverage the power of Kubernetes to schedule pods based on need.
As long as stateless applications can gracefully exit when the preemptible VMs send a SIGTERM signal, we can allow Kubernetes to schedule pods to another available preemptible VM. In the case when another preemptible VM is unavailable (either during an auto-scale event or after the termination signal was given to a prior node), we can schedule these workloads to a reserved instance. For critical or stateful sets, we can define node affinity values to always schedule them on reserved instances. Just by setting node affinity configs, you can start saving cost on your clusters.
Whether you are using Helm or explicitly using kubectl commands, it is easy to forget your context and accidentally push the wrong environment to the wrong cluster. Safeguard yourself from these accidents by always specifying the context/cluster to which you want to push changes. You can also use command line helpers like this one when you start to have more than several clusters.
One of the downsides to Kubernetes, or perhaps microservice architectures more generally, is how quickly complexities can grow. You’ll need a good plan to manage complexities introduced by more layers or monitoring of your cluster, nodes, pods, and applications will suffer.
Start out with consistent logging practices. Either pass context between services or use a tracing system for better observability. Needless to say, this will be extremely helpful when a cluster enters an unknown state and you’ll be able to more easily track down the message flow.
There are various tools to help tame Kubernetes. At Leverege, we use Helm as a templating tool to package Kubernetes manifests as charts. Until you have a CD system set up, Helm can help with rolling updates and easy rollbacks. If you don’t have a monitoring system set up, become familiar with the tools Google provides by default. Familiarize yourself with cloud shell and kubectl commands to quickly diagnose and modify your manifests.
When creating your Deployment or Ingress templates on Kubernetes, it’s easy to forget to reserve static IPs for your public endpoints. The best practice is not to use the type LoadBalancer and use a small number of Ingress endpoints, but if IPs are not reserved or specified, Kubernetes might assign them to a new endpoint that you might not expect.
One of the nice features of GKE is auto-upgrades. If you’re running a regional deployment, you can upgrade your master nodes without downtime. However, always set your update windows so you can monitor changes in production. There might be hidden, unintended consequences to your key management, logging, or authentication schemes that might break due to an upgrade.