Prometheus Alert Manager on Kubernetes

Wercker is a Docker-Native CI/CD Automation platform for Kubernetes & Microservice Deployments

Garland Kan
Garland Kan
December 6, 2016

This blog will show you how to setup the Alert Manager on Kubernetes.

There are a lot of blogs out there talking about how to setup Prometheus on Kubernetes, but it seems all of them end without finishing the entire setup with the Alert Manager, which is one of the big reasons why I want Prometheus. This blog will show you how to setup the Alert Manager on Kubernetes. Here are some of the blogs on how to setup Prometheus:

The CoreOS Prometheus Operator has a PR open to add in the Alert Manager into the entire setup, which will be great once that is in. For now, we still want to be alerted on things.

Alert Manager Parts

There are a few pieces to the Alert Manager. The Prometheus documentation has a good overview, but I'll recap here.

Alert Manager
This is a container that handles sending out the alert notifications, and this is where you configure how you want to route the alerts and to where (Pager Duty, Slack, etc.).

Alert Rules
This is a bunch of Prometheus queries where you can set thresholds and decide when to fire off an alert. These alert rules are evaluated on the Prometheus server/container, so this rules config file will be attached to the Prometheus container.

How it all fits together

Prometheus Pod
Here is a full pod file to launch Prometheus and the Alert Manger with supporting configmaps.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: prometheus-monitoring
labels:
spec:
replicas: 1
selector:
matchLabels:
app: prometheus-monitoring
template:
metadata:
name: prometheus-monitoring
labels:
app: prometheus-monitoring
spec:

containers:
# Prometheus server
- name: prometheus
image: prom/prometheus:v1.3.1
args:
- '-storage.local.retention=72h'
- '-storage.local.path=/home'
- '-storage.local.memory-chunks=500000'
- '-config.file=/etc/prometheus/prometheus.yml'
- '-alertmanager.url=http://localhost:9093'
- '-web.external-url=http://prometheus.example.com'
ports:
- name: web
containerPort: 9090
volumeMounts:
- name: config-volume-prometheus
mountPath: /etc/prometheus
- name: config-volume-alert-rules
mountPath: /etc/prometheus-rules
- name: prometheus-data
mountPath: /home
resources:
limits:
cpu: 12000m
memory: 10000Mi
requests:
cpu: 4000m
memory: 4000Mi

#Alert manager
- name: alertmanager
image: quay.io/prometheus/alertmanager:v0.5.0
args:
- -config.file=/etc/prometheus/alertmanager.yml
volumeMounts:
- name: config-volume-alertmanager
mountPath: /etc/prometheus

# Volumes and config maps
volumes:
- name: config-volume-prometheus
configMap:
name: prometheus
- name: config-volume-alertmanager
configMap:
name: prometheus-alertmanager
- name: config-volume-alert-rules
configMap:
name: prometheus-alert-rules
- name: prometheus-data
emptyDir: {}

There are a few configmaps in here:

  • prometheus
    • The main Prometheus configuration file
    • This will be mounted to the Prometheus container
  • prometheus-alertmanager
    • The main alert manager's config file
    • This will be mounted to the Alert Manager's container
  • prometheus-alert-rules
    • Alerting rules
    • This will be mounted to the Prometheus container

Prometheus Config
Here is a fairly standard config for Prometheus to scrape metrics from Kubernetes. It can be found here: https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml

I have converted the file to a Kubernetes configmap. Then we can just mount this file into the pod.

#
# Based on: https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml
#
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus
data:
prometheus.yml: |-
global:
scrape_interval: 15s
# A scrape configuration for running Prometheus on a Kubernetes cluster.
# This uses separate scrape configs for cluster components (i.e. API server, node)
# and services to allow each to use different authentication configs.
#
# Kubernetes labels will be added as Prometheus labels on metrics via the
# `labelmap` relabeling action.

# This file comes from the kubernetes configmap
rule_files:
- '/etc/prometheus-rules/alert.rules'

# Scrape config for cluster components.
scrape_configs:

# Source of the kubernetes scrape config: https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml

- job_name: 'kubernetes-apiservers'

# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https

# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration (`in_cluster` below) because discovery & scraping are two
# separate concerns in Prometheus.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# If your node certificates are self-signed or use a different CA to the
# master CA, then disable certificate verification below. Note that
# certificate verification is an integral part of a secure infrastructure
# so this should only be disabled in a controlled environment. You can
# disable certificate verification by uncommenting the line below.
#
# insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

# Keep only the default/kubernetes service endpoints for the https port. This
# will add targets for each API server which Kubernetes adds an endpoint to
# the default/kubernetes service.
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https

- job_name: 'kubernetes-nodes'

# Default to scraping over https. If required, just disable this or change to
# `http`.
scheme: https

# This TLS & bearer token file config is used to connect to the actual scrape
# endpoints for cluster components. This is separate to discovery auth
# configuration (`in_cluster` below) because discovery & scraping are two
# separate concerns in Prometheus.
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
# If your node certificates are self-signed or use a different CA to the
# master CA, then disable certificate verification below. Note that
# certificate verification is an integral part of a secure infrastructure
# so this should only be disabled in a controlled environment. You can
# disable certificate verification by uncommenting the line below.
#
# insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

kubernetes_sd_configs:
- role: node

relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)

# Scrape config for service endpoints.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/scrape`: Only scrape services that have a value of `true`
# * `prometheus.io/scheme`: If the metrics endpoint is secured then you will need
# to set this to `https` & most likely set the `tls_config` of the scrape config.
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: If the metrics are exposed on a different port to the
# service then set this appropriately.
- job_name: 'kubernetes-service-endpoints'

kubernetes_sd_configs:
- role: endpoints

relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)(?::\d+);(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_service_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name

# Example scrape config for probing services via the Blackbox Exporter.
#
# The relabeling allows the actual service scrape endpoint to be configured
# via the following annotations:
#
# * `prometheus.io/probe`: Only probe services that have a value of `true`
- job_name: 'kubernetes-services'

metrics_path: /probe
params:
module: [http_2xx]

kubernetes_sd_configs:
- role: service

relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox
- source_labels: [__param_target]
target_label: instance
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_service_namespace]
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: kubernetes_name

# Example scrape config for pods
#
# The relabeling allows the actual pod scrape endpoint to be configured via the
# following annotations:
#
# * `prometheus.io/scrape`: Only scrape pods that have a value of `true`
# * `prometheus.io/path`: If the metrics path is not `/metrics` override this.
# * `prometheus.io/port`: Scrape the pod on the indicated port instead of the default of `9102`.
- job_name: 'kubernetes-pods'

kubernetes_sd_configs:
- role: pod

relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (.+):(?:\d+);(\d+)
replacement: ${1}:${2}
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_pod_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name

# Monitor etcd
# The "etcd-cluster" is a kube service pointing to the etcd nodes
- job_name: 'etcd'
scrape_interval: 5s
static_configs:
- targets: ['etcd-cluster:2379']

The only change was adding this bit to it:

    # This file comes from the kubernetes configmap
rule_files:
- '/etc/prometheus-rules/alert.rules'

This is telling Prometheus to include this rules file in when starting up.

Alert Manager's Configmap
This config map describes how you want to route the notifications and the authentication tokens to be able to send to those destinations like Pager Duty or Slack.

There are few notable parts to this config. The first is the default receiver, and in this case, it is the “slack_chatbots” receiver. This is just the default receiver to send notifications to if it does not match any other destinations.

The “routes” section below describes different places you can direct notifications to based on the labels on the alerts (we will describe that after this section). You can send certain alerts to one Slack channel and other alerts to another channel.

Finally, there is the “receiver” configuration. The info here has the authentication to that service, and for Slack, it has the message blog you want to send over there. The text section in the Slack receiver loops through the alert event and just spits out all of the labels into the Slack message. Sometimes it is good to see all of the labels to get more context to what the alert is about.

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-alertmanager
data:
alertmanager.yml: |-
global:
# The smarthost and SMTP sender used for mail notifications.
smtp_smarthost: 'localhost:25'
smtp_from: 'alertmanager@example.org'

# The root route on which each incoming alert enters.
route:
# The root route must not have any matchers as it is the entry point for
# all alerts. It needs to have a receiver configured so alerts that do not
# match any of the sub-routes are sent to someone.
receiver: 'slack_chatbots'

# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ['alertname', 'cluster']

# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s

# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m

# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 3h

# All the above attributes are inherited by all child routes and can
# overwritten on each.


routes:
# This routes performs a regular expression match on alert labels to
# catch alerts that are related to a list of services.
- match:
service: frontend
receiver: slack_chatbots
continue: true

- match:
service: backend
receiver: pager_duty
continue: true

# Inhibition rules allow to mute a set of alerts given that another alert is
# firing.
# We use this to mute any warning-level notifications if the same alert is
# already critical.
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
# Apply inhibition if the alertname is the same.
equal: ['alertname']


receivers:

- name: 'slack_chatbots'
slack_configs:
- send_resolved: true
api_url: 'https://hooks.slack.com/services/xxxxxxx'
channel: '#chatbots'
text: >-

Summary:
Description:
Details:
- =

Playbook:
Graph:


- name: 'pager_duty'
pagerduty_configs:
- service_key: xxxxxxxxxxxxxxxxxx

Alert Rules
An alert rule is a Prometheus query with some more special context to it. Like for how long this query has to active for and other meta information on what this alert is and what groups they belong to for notifications.

The below alert rule config file contains two rules. The first one is looking for nodes that have high CPU usage that last for more than 10 minutes. If this query returns an event for over 10 minutes, then it will start alerting. The label here has “service = backend”, which will match with the Alert Manger's configuration above and this one will be sent to Pager Duty.


The second alert is looking for DNS lookup failure from Prometheus. If the failure persist for more than 1 minute, it will start to alert with the label “frontend”. In this case per the Alert Manager's config above, it will send this alert to Slack.

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-alert-rules
data:
alert.rules: |-
## alert.rules ##

#
# CPU Alerts
#
ALERT HighCPU
IF ((sum(node_cpu{mode=~"user|nice|system|irq|softirq|steal|idle|iowait"}) by (instance, job)) - ( sum(node_cpu{mode=~"idle|iowait"}) by (instance,job) ) ) / (sum(node_cpu{mode=~"user|nice|system|irq|softirq|steal|idle|iowait"}) by (instance, job)) * 100 > 95
FOR 10m
LABELS { service = "backend" }
ANNOTATIONS {
summary = "High CPU Usage",
description = "This machine has really high CPU usage for over 10m",
}

#
# DNS Lookup failures
#
ALERT DNSLookupFailureFromPrometheus
IF prometheus_dns_sd_lookup_failures_total > 5
FOR 1m
LABELS { service = "frontend" }
ANNOTATIONS {
summary = "Prometheus reported over 5 DNS lookup failure",
description = "The prometheus unit reported that it failed to query the DNS. Look at the kube-dns to see if it is having any problems",
}


Summary

There are a lot of parts to a Prometheus system including Alerting but once you know all of the parts and how they interact together, it becomes a lot more manageable.

Like Wercker?

We’re hiring! Check out the careers page for open positions in Amsterdam, London and San Francisco.

As usual, if you want to stay in the loop follow us on twitter @wercker or hop on our public slack channel. If it’s your first time using Wercker, be sure to tweet out your #greenbuilds, and we’ll send you some swag!

Topics: Kubernetes, Tutorials