How to Build a Simple ChatOps Bot with Kafka, Grafana, Prometheus, and Slack

If you operate any kind of software system, you know how important it is to have visibility into the health and performance of that system at all times. Traditionally, this meant keeping an eye on dashboards and responding to alerts. But what if you could monitor your system and diagnose issues all from the comfort of your team chat?

Enter ChatOps – the practice of managing technical and business operations through a chat interface. With a well-designed ChatOps bot, you can do things like:

  • Query the current status of services
  • Get on-demand graphs of key metrics
  • Perform basic operational tasks
  • Get smart notifications about issues

All without leaving your chat window. This can be incredibly valuable when you‘re on the go and need to quickly check in on systems from your phone.

In this tutorial, we‘ll walk through building a simple ChatOps bot that integrates with Kafka, Prometheus, and Grafana to enable easy monitoring over Slack. We‘ll cover:

  1. Setting up the monitoring infrastructure with Kafka, Prometheus, and Grafana
  2. Building a Slack bot in Python to query status and metrics
  3. Extending and customizing the bot

By the end, you‘ll have a working bot that can respond to questions and post relevant graphs right in your Slack channels. Let‘s get started!

The Monitoring Stack

Before we can build our helpful chat bot, we need a system for it to monitor. We‘ll use Kafka as an example system, with Prometheus and Grafana to collect and visualize the metrics.

Component Overview

Here‘s a quick primer on the components in our monitoring stack:

Kafka: A distributed streaming platform that lets you publish and subscribe to streams of records. Kafka is run as a cluster of one or more servers that can span multiple datacenters.

Prometheus: An open-source monitoring system that collects metrics from configured targets at given intervals, evaluates rule expressions, displays results, and can trigger alerts if some condition is observed to be true.

Prometheus JMX Exporter: An exporter that can scrape and expose JMX mBeans of a JVM process via HTTP for Prometheus consumption. This allows us to collect metrics from Kafka.

Grafana: An open-source platform for beautiful analytics and monitoring. It allows you to query, visualize, alert on and understand your metrics no matter where they are stored.

Together, these tools allow us to run Kafka, collect detailed metrics on its performance, and visualize those metrics in nice dashboards.

Setting Up the Stack

We‘ll use Docker Compose to spin up our monitoring stack all at once. But first, we need to configure a few things to let the pieces talk to each other.

To expose metrics from Kafka to Prometheus, update the Kafka launch config in docker-compose.yml:

kafka:
  image: wurstmeister/kafka:1.0.0
  ...
  environment:
    ...  
    KAFKA_JMX_OPTS: "-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Djava.rmi.server.hostname=kafka -Dcom.sun.management.jmxremote.rmi.port=1099"      
    JMX_PORT: 1099

This enables the JMX metrics and exposes them on port 1099. The kafka-jmx-exporter service can then connect to port 1099 to scrape the metrics and serve them up for Prometheus:

kafka-jmx-exporter:
  ...
  environment:
    JMX_PORT: 1099
    JMX_HOST: kafka

We also need to make sure Prometheus knows to collect metrics from the JMX exporter by adding it as a target in prometheus.yml:

- job_name: kafka
  static_configs:
    - targets: 
      - kafka-jmx-exporter:8080

Finally, let‘s point Grafana to a pre-made dashboard definition that‘s tailored for Kafka monitoring in grafana.ini:

[dashboards.json]
enabled = true
path = /etc/grafana/dashboards

With those pieces configured, we can bring up the whole stack with:

docker-compose up -d

Once everything spins up, you should be able to browse to Grafana and see the Kafka dashboard populated with metrics.

Building the Slack Bot

Now for the fun part – building a bot that can talk to our monitoring stack! We‘ll write the bot in Python and use the Slack API to post messages.

Creating a Slack App

First, we need to create a new Slack App and Bot User that our script can use to interact with Slack:

  1. Go to https://api.slack.com/apps and click "Create New App"
  2. Give your app a name and select a workspace
  3. Under "Add features and functionality", click "Bots"
  4. Add a bot user and give it a display name and default username
  5. Install the app to your workspace and note the bot‘s OAuth token

Connecting to Slack

In our Python script, we first need to create a Slack Client object and connect to the RTM API to listen for messages:

@slack.RTMClient.run_on(event=‘message‘)
def handle_message(**payload):
  # Here‘s where we‘ll handle incoming messages

The RTMClient from the Slack SDK gives us an easy way to listen for and respond to messages. We‘ll fill in the handle_message function with the core logic of our bot.

Responding to Queries

Let‘s teach our bot to respond to a few basic queries:

if ‘help‘ in message:
    response = "Here‘s what I can do:\n" + \
                "• health - Get current Kafka health\n" + \
                "• metrics - Get a Kafka metrics graph\n" + \
                "• config - Show Kafka configuration"

if ‘health‘ in message:
    response = "Kafka cluster is healthy!" 

if ‘metrics‘ in message:
    # Generate a Grafana graph and post it
    pass

if ‘config‘ in message:
    config = get_kafka_config()
    response = f"Here is the Kafka config:\n```\n{config}\n```"

The most interesting command here is metrics, which will generate a Grafana graph and post it to the channel.

Generating Graphs

Grafana has a nifty feature that lets you render any dashboard panel as an image. Usually this is powered by PhantomJS, but there‘s a bug in recent versions that prevents it from working reliably.

Instead, we can use Puppeteer, a headless Chrome Node API, to take screenshots. Here‘s how we set it up:

  1. Run a container with Puppeteer and bind-mount the current directory
  2. Use the Docker API from our script to run Puppeteer and take a screenshot of the Grafana panel
  3. Poll the directory for new PNG files, then POST them to Slack and clean up
def generate_graph(url, channel):

    client = docker.from_env()

    container = client.containers.run(‘alekzonder/puppeteer:1.0.0‘,
                                      command=f‘screenshot {url} 1366x768‘,
                                      volumes={os.path.dirname(__file__): {
                                        ‘bind‘: ‘/screenshots‘,
                                        ‘mode‘: ‘rw‘}},
                                      detach=True)


    # Wait for new PNG files
    while True:
        time.sleep(1)
        for filename in os.listdir(‘.‘):
            if filename.endswith(‘.png‘):
                upload_graph(filename, channel)
                os.remove(filename)
                return

def upload_graph(filename, channel):
    with open(filename,‘rb‘) as image:
        slack_client.files_upload(file=image, filename=filename, channels=channel)

We dynamically create a container, let it screenshot the Grafana dashboard, bind-mount the current directory so the PNG ends up on our local filesystem, then upload that file to Slack.

This uses the lower-level Docker Engine API rather than docker-compose, giving us more flexibility to do one-off tasks with containers.

Extending the Bot

We‘ve built a simple bot that can check Kafka health, pull up metrics graphs, and show the current config. But there‘s a lot more you could teach it!

Some ideas to extend its functionality:

  • Show consumer group status and lag
  • Trigger leader election
  • Modify broker configs
  • Create/delete topics
  • Integrate with an incident management system

The sky‘s the limit in terms of operational tasks you can delegate to a well-designed ChatOps bot. By bringing your infrastructure controls into the place where your team is already communicating, you can level up your efficiency and incident response.

Conclusion

In this post, we took a whirlwind tour through setting up Kafka monitoring with Prometheus and Grafana, then built a Slack bot to query metrics and config.

The basic principals we covered:

  • Exposing Kafka metrics to Prometheus with JMX
  • Visualizing metrics in Grafana dashboards
  • Using the Slack RTM API to listen and respond to messages
  • Generating dashboard snapshots with Puppeteer
  • Using the Docker Engine API to run one-off containers

Can be applied to monitoring all kinds of distributed systems. If you‘re feeling inspired, try applying this setup to something you operate. And don‘t be afraid to get creative with the types of ChatOps commands that could help you move faster!

Similar Posts