The most rudimentary level of monitoring is administered via native Kubernetes features such as probes, cAdvisor, heapster, and kube-state-metrics. Readiness probe checks the health of a container before a pod is pushed live. Liveness probe periodically monitors the pod by running a defined set of commands to ensure that the pod is running as intended. A failed probe initiates a restart of the pod in an attempt to escape the error state.
cAdvisor and heapster collect container resource usage. cAdvisor oversees all the pods in a node and collects overall machine resource utilization (CPU, memory, filesystem and network usage). Heapster aggregates this data across all the nodes in a cluster to provide an overview of the cluster’s health. Lastly, kube-state-metrics provides information via the Kubernetes API. This exports information such as the number of replicas the cluster has scheduled vs. currently available; number of pods running vs. stopped; and number of pod restarts.
While these native tools provide a basic overview of the Kubernetes health, they do not store this data nor help visualize it in an easily consumable manner. Additionally, they cannot answer application level metrics (e.g. how fast is my ElasticSearch query, how many times did my API trigger an error, how many messages are being processed). To address this need, you will need more powerful tools.
If you're using the Google Kubernetes Engine, event exporter for Stackdriver Monitoring is enabled by default if cloud logging is enabled. For instructions on deploying to existing clusters, please see the Official Documentation.
On Stackdriver, you can monitor several things:
Building a dashboard using the above metrics is simple and straightforward, but making adjustments to the default settings to extract specific information is not well supported. If you're looking for a more robust monitoring solution, we recommend a combination of Prometheus and Grafana. It seems like others agree as well.
If you would like to use a combination of Stackdriver and Prometheus, there are tools to export data from each. We will cover exporting Stackdriver metrics to monitor GCP components with Prometheus below. One reason for doing this may be to monitor GCP-related metrics, such as the number of PubSub failed deliveries, BigQuery usage information, and Firebase status.
The following guide is heavily inspired by Sergey Nuzhdin's post on LWOLFS BLOG. He does a fantastic job of laying out how to deploy Prometheus and Grafana to Kubernetes using Helm charts. However, at the time of writing this, Nuzhdin does not cover deploying the newer version of Prometheus (2.1) and configuring alerts. To fill this knowledge gap, we created a Prometheus + Grafana Setup Guide that goes through the setup process step-by-step.
Prometheus is an open-source, time-series monitoring tool developed by the Cloud Native Computing Foundation (CNCF) project—the foundation behind Kubernetes. It’s a flexible system that can collect metrics, run complex queries, display graphs, and trigger alerts based on custom rules. Default deployment of Prometheus on Kubernetes scrapes all the aforementioned metrics exposed by probes, cAdvisor, heapster, and kube-state-metrics.
Additional benefits of Prometheus kick in when annotations are added to individual applications to expose custom metrics. As long as an endpoint can be reached with a predefined Prometheus-like format, Prometheus can scrape the metrics and aggregate with other information. The Prometheus server can also be linked to a persistent memory to store all the monitoring metrics. Lastly, Prometheus comes with an Alertmanager that can trigger alerts based on Prometheus-defined rules and notify DevOps engineers via email or Slack.
The Helm chart for Prometheus will set up the default monitoring metrics using kube-state-metrics (if enabled). But what if you want to monitor your node.js app? You can use a prom-client to format the metrics to be scraped by Prometheus. By default, Prometheus will scrape the '/metrics' endpoint, and the various client libraries will format the outputs to be read by Prometheus.
Since event exporting is enabled by default with GKE, StackDriver is a great option to monitoring GCP metrics. These include mature products such as PubSub, Compute, and BigQuery, as well as newer products in Cloud IoT, ML, and Firebase. (For a complete list of available metrics, see this documentation.)
Exporting these events to Prometheus follows a similar procedure as exporting custom metrics. In fact, GitHub user frodenas has made a Docker image of an exporter available to deploy via Helm/Kubernetes easily.
Grafana is an open-source software for time-series analysis and visualization that has native plugins for Prometheus and other popular libraries (Elasticsearch, InfluxDB, Graphite, etc.). While Prometheus provides a rudimentary visualization functionality, it is limited to a simple time-series graph format. Grafana allows for easier visualization of all of the metrics exported by Prometheus to be consumed in various formats: status checks, histograms, pie charts, trends, and complex graphs.