Saturday, January 17, 2015

Monitoring Web Service Performance with Elasticsearch, Logstash and Kibana

Elasticsearch, Logstash and Kibana (www.elasticsearch.org) are fantastic open source products for collecting, storing, monitoring and analyzing events.

Here is one way you can configure Logstash and Kibana to monitor system loads and web application response times.  (Elasticsearch, while powerful, does its basic job so well that I did not need to configure anything for it in this example, other than to set it up so Kibana could connect to it, which Kibana told me how to do the first time I tried.)

Configuring Logstash

Note: This configuration works on Ubuntu servers. Your environment may need slight adjustments.

input {
   # System Load
    exec {
        # We are generating our own events here, so we 
        # get to assign a type of "system-loadavg" so 
        # Logstash can apply a specific filter to them.
        type => "system-loadavg"

        # Produce the system loads on a single line.
        # Example:
        #     0.23  0.18  1.03
        command => "cat /proc/loadavg | awk '{print $1,$2,$3}'"
     
        # Do this every 60s while Logstash is running.
        interval => 60
    }

    # Web Application Response Time
    exec {
        # Again, generating our own events, with type 
        # "ws-ping"
        type => "ws-ping"


        # (Optional) Add a field to our custom event.  If 
        # you monitor more than one service this way, you 
        # will be able to distinguish them using this field.
        add_field => { "service" => "jira" }


        # Execute a couple of commands to get a web 
        # page from JIRA
        #    /usr/bin/time    Uses the Linux time command, 
        #                     not the Bash built-in function
        #    -f '%e'          Specify the output should just 
        #                     be the real time, in seconds.
        #    2>&1             Redirect time's output from 
        #                     stderr to stdout for Logstash
        #
        #    curl             Just google this for an idea 
        #                     of its sheer awesomeness.
        #    -s               Silent mode, don't want the 
        #                     progress bar in the output.
        #    -k               Don't check SSL certificate 
        #                     (this is acceptable for our 
        #                     self-signed internal cert)
        #    -o /dev/null     Send the output to /dev/null. 
        #                     We don't care what the page
        #                     contains, just how long it 
        #                     takes.
        #    -u someuser:somepass 
        #                     We do not want to measure the 
        #                     static login page, so we'll 
        #                     need curl to authenticate with 
        #                     JIRA so it can get to a dashboard.
        #    https://our.jira.host/jira/ 
        #                     The page to load.
        #
        command => "/usr/bin/time -f '%e' curl -sk -o /dev/null -u someuser:somepass https://our.jira.host/jira/ 2>&1"

        # Do this every 60s while Logstash is running
        interval => 60
    }
}


filter {
    # Transform the system-loadavg event by adding names 
    # to the numbers
    if [type] == "system-loadavg" {
        grok {
            # Parse/match the event using the pattern 
            # of three numbers separated by spaces. Map 
            # the numbers to float numbers in Elasticsearch 
            # (otherwise, we won't be able to plot them.)
            match => { "message" => "%{NUMBER:load_avg_1m:float} %{NUMBER:load_avg_5m:float} %{NUMBER:load_avg_15m:float}" }
         
            # Only store matched values that have names 
            # provided for them (this is the default)
            named_captures_only => true
        }
    }
 
    # Transform the ws-ping event by adding a name to 
    # the number created by the input event.
    if [type] == "ws-ping" {
        grok {
            match => { "message" => "%{NUMBER:responsetime:float}" }
        }
    }
}


output {
    # Send all events to Elasticsearch.  Be sure to configure 
    # the correct host.
    elasticsearch { host => "your.elasticsearch.host" }
}


Logstash will now generate performance monitor events every minute and send them to Elasticsearch.


Visualizing in Kibana

Here's how to configure your Kibana dashboard to watch and visualize these events.



Add the following queries to your dashboard, as two separate queries.  You can optionally set the color and alias for the queries so they look better on the screen.  This is done by clicking the color dot in the query field in Kibana.

  host:our.jira.host AND type:system-loadavg

  type:ws-ping AND service:jira

Notice that Logstash automatically adds a "host" field to the event.  You can see that our custom event types and fields have made it, too.

To graph the system load average, add a new panel to your dashboard with the following parameters.

Panel type: Histogram
Title: 5 Minute Load Average
Chart Value: mean
Value field: load_avg_5m  (We named this in the Logstash configuration)
Chart option: Line (not bars), not stacked
Queries: selected, then select the query that corresponds to the system-loadavg search you defined above.  It will be in the list Kibana provides for you to select.

To graph the response time, do the same thing as above, but select the responsetime field for the Value field, and the query that corresponds to the ws-ping event type as defined above.

That's it!

Here are a couple of things I learned along the way.

  • Kibana didn't update my charts at first.  The updating icon on each panel just sat there, spinning.  The problem was that I hadn't told Logstash to send the numbers into Elasticsearch as numbers, so it sent them as strings.  Kibana was getting errors back from Elasticsearch, but wasn't telling us anything about it.  Using the Inspect tool (the "i" icon on the panel) gives you a curl command to see the result of the query for yourself.
  • Kibana didn't give us the option of plotting the actual value, so we use the mean to provide the closest representation of the value over the time between plot points.  It's a good compromise since it smoothes the plot somewhat.
  • We collect more events from our application logs and Linux system logs. Applying a grok pattern to those was more difficult, and the docs could be better.  But there is a Grok Debugger that allows you to paste in a sample of your event data and play with the grok patterns to see what it is able to match.  Once I found this, my work for the week was done in no time at all.
This is a short intro to a very powerful event management platform.  I hope it helps you get up and running a little faster.

5 comments:

  1. Excellent post! I'll be spending my afternoon trying this on for size. I've not had much luck with the ELK stack displaying exactly what I want but this gives me hope. Keep up the good work.

    ReplyDelete
    Replies
    1. Good luck! I hope you can get your system working just like you hope.

      Delete
  2. so much possibilities :) love elk stack!

    ReplyDelete
  3. I am currently using ELK for logging and monitoring a relatively small amount of cloud instances.

    There is no doubt that for logs aggregation and visualization it is a wonderful set of tools, however in my opinion ELK isn't suited for graphing ad-hoc metrics.

    Tools like graphite are good for dynamic day to day usage, while Kibana is more appropriate as a dashboard you build once and just stick somewhere.

    ReplyDelete
    Replies
    1. Thanks for the comment. It's a good point. These charts are part of a dashboard that I use to check on things every morning when I get into the office. I agree that ad-hoc charting is painful. But ad-hoc querying of aggregated logs is where ELK really shines.

      Delete

I value comments as a way to contribute to the topic and I welcome others' opinions. Please keep comments civil and clean.