Using SLATE to Deploy Telegraf Monitoring

The SLATE platform provides a powerful, simple way to deploy a large variety of applications. In this blog post, we will demonstrate how SLATE can be leveraged to quickly deploy a monitoring solution for ScienceDMZ network infrastructure. Our monitoring solution will use Telegraf to monitor a group of hosts with the Simple Network Management Protocol, usually referred to as SNMP. More information about SNMP can be found here. We will assume collected metrics will be sent to a database at Indiana University’s Global Research Network Operations Center (GlobalNOC). However, metrics can also be sent to a separate InfluxDB database.

GlobalNOC Time-Series Data Services

Metrics marked for export using the tsds Telegraf output plugin are send to the GlobalNOC Time-Series Data Services (TSDS) endpoint, given that correct credentials are provided. Metrics exported to the database join metrics for devices at many research institutions on the Internet2 network. This wealth of data exposes possibilities for many interesting applications. One such application is the Parallel Traceroute Visualization Project being developed at the University of Utah Center for High Performance Computing. The project aims to be a tool for researchers to better understand data transfer nodes and the links between them by running and displaying a visualization of traceroutes between pairs of specified hosts. Once a traceroute is visualized, a user is able to hover over a host to display more information about network conditions at that host. Additionally, the GlobalNOC TSDS enables researchers to view graphs of historical network data for that specific host. This can be useful for reasoning about data transfer node performance and other troubleshooting. The Parallel Traceroute Visualization Project and others like it are able to use the data stored in the TSDS to create more informative tools for researchers and research institutions.

Prerequisites

It is assumed that you already have access to a SLATE-registered Kubernetes cluster, and that you already have installed and set up the SLATE command line interface. If not, instructions can be found at SLATE Quickstart. Additionally, it is assumed that there is an SNMP daemon (responds to SNMP requests) running on one or more target hosts.

On CentOS 7, a simple SNMP setup can be installed by running the following commands:

yum install net-snmp net-snmp-utils
systemctl enable snmpd
systemctl restart snmpd

More details can be found here.

Configuration

To begin, a configuration file for the application must be fetched. The SLATE client provides a simple way to do this with the command below:

slate app get-conf telegraf > telegraf.yaml

This will save a local copy of the telegraf configuration, formatted as a .yaml file. We will modify this configuration accordingly, and eventually deploy the application with this configuration. Open the configuration file with your preferred text editor, and follow the instructions below to configure each piece of the application.

GlobalNOC Database Configuration

Navigate to the grnocOutput section. First, make sure that the enabled flag is set to true. Next, to push to GlobalNOC’s databases, credentials must be obtained.

Process for getting credentials here

Next, configure the database endpoint by filling out the grnocOutput section with the hostname, username, and password obtained earlier. This section will look like this:

grnocOutput:
  enabled: true
  hostname: "tsds.hostname.net"
  username: "tsds username"
  password: "tsds password"

Note that if GRNOC output is enabled, you will not be able to specify a custom set of OIDs. An OID is an identifier specifying a metric to monitor. More information can be found here.

Target Configuration

Locate the hostGroup section, under targets. It will look something like this:

targets:
  - hostGroup:
      community: "public"
      hosts:
        - "127.0.0.1:161"
      counter64Bit: false
      oids: |-
        [[inputs.snmp.field]]
          oid = "DISMAN-EVENT-MIB::sysUpTimeInstance"
          name = "uptime"

Hosts

Under hosts, replace the placeholder IP address with the IP address or full DNS name of the host you want to monitor. Additionally, specify the default SNMP port (161) by appending a colon followed by this number. Add as many additional hosts underneath as wanted. As per yaml syntax, preface them with a hyphen and surround with quotes to reduce ambiguity.

Community String

Next, change the community parameter to the appropriate SNMP community string.

Counter Type

The counter64Bit parameter is only relevant if GRNOC output has been enabled. The counter64Bit flag switches between two different sets of OIDs, one for hosts with 64-bit SNMP counters, and one for hosts with 32-bit SNMP counters.

OIDs

Following this, configure the appropriate OIDs to monitor. The oids parameter is formatted a little differently than the rest of the file. To provide fine-grained control over these parameters, this section inherits from the Telegraf service’s syntax. Specify the OIDs you want to monitor by following documentation here.

Host Groups

Note that the hosts configured earlier must all share the same group of settings. To monitor additional hosts with different configurations (e.g. different community strings or desired OIDs), simply duplicate the entire hostGroup section, and populate it with the alternate configuration. This can be done as many times as needed. The default configuration file includes two hostGroup sections to illustrate this. If only one host group is required, delete the second hostGroup section.

Additional Parameters

There are three other parameters that can be configured. The first of these is writeToStdout. When set to true, Telegraf will additionally write its metrics to stdout inside its container. This can be useful for debugging, but is not necessary. Set this as needed.

The second parameter is interval. This controls the frequency at which Telegraf collects metrics. Specify your desired value here by combining an integer with a time unit. Valid time units include “ns”, “us”, “ms”, “s”, “m” and “h”. For example, to collect metrics every five seconds, enter the following:

interval: 5s

The third parameter is ‘flushInterval. This control the frequency at which Telegraf flushes its output plugins, or writes to the specified databases. Set this in the same fashion as the interval` parameter.

InfluxDB Configuration

To enable InfluxDB output, navigate to the influxOutput section of the configuration file. It will look like this:

influxOutput:
  enabled: false
  endpoint: "http://127.0.0.1:9999"
  database: "telegraf"
  httpBasicAuth:
    enabled: false
    username: "telegraf"
    password: "metrics"

First, set enabled to true. Next, set endpoint to the endpoint of the database you wish to push to. Make sure to specify the proper protocol and port. Then, specify the database name as the database parameter. If basic http authentication has been enabled on your database, you will need to set the httpBasicAuth flag to true, and supply the proper username and password. If not, this section can be left disabled.

Installation

Once the application has been properly configured, we must deploy it. To install the application onto a SLATE cluster, simply run the command below:

slate app install telegraf --group <group_name> --cluster <cluster_name> --conf telegraf.yaml

This installs the Telegraf application onto the cluster specified, with the configuration previously specified.

Testing

Testing with Netcat

If you want to quickly test the application without setting up a database, netcat can be used as an improvised database endpoint. On the machine you want to receive metrics on, enter the command:

nc -lk 9999

This will listen for any incoming data on port 9999. Then, configure the InfluxDB endpoint to point to port 9999 on this machine. Enter something like this:

influxOutput:
  enabled: true
  endpoint: "http://<machine_ip_here>:9999"
  database: "telegraf"

Testing with InfluxDB

Alternately, an InfluxDB endpoint can be simply set up using Docker. Configure Docker on the machine you want to receive metrics on, and pull the InfluxDB image with:

docker pull influxdb

Next, run the image in a container with the appropriate ports and volumes:

docker run -p 9999:9999 -v $PWD:/var/lib/influxdb influxdb

Configure Telegraf’s InfluxDB output to push over http to this endpoint on port 9999.

Troubleshooting

In the event that something is not working properly, logs from the container running Telegraf can be printed with the following command:

slate instance logs <instance_id>

An instance_id is a unique, randomly-generated string prefaced with “instance” that SLATE assigns to each running experiment. This ID is printed on app installation. Additionally, a list of running applications and their IDs can be printed with the command:

slate instance list

For additional help, or to report a bug, please contact the SLATE team.

Configuration Notes

The following table lists the configurable parameters of the Telegraf monitoring application and their default values.

ParameterDescriptionDefault
InstanceOptional string to differentiate SLATE experiment instances””
writeToStdoutOptionally write to stdout in containertrue
intervalData collection interval5s
flushIntervalOutput flush interval300s
grnocOutput.enabledWhether to write to GlobalNOC databasetrue
grnocOutput.hostnameDatabase endpointtsds.hostname.net
grnocOutput.usernameDatabase usernametsds username
grnocOutput.passwordDatabase passwordtsds password
targets.hostGroup.communityCommunity string of hostGrouppublic
targets.hostGroup.hostsHosts to monitor127.0.0.1:161
targets.hostGroup.counter64bitType of SNMP counter on host machinefalse
targets.hostGroup.oidsSNMP OIDs to polltelegraf configuration monitoring system uptime
influxOutput.enabledWhether to write to InfluxDBtrue
influxOutput.endpointDatabase endpointhttp://127.0.0.1:9999
influxOutput.databaseDatabase nametelegraf
influxOutput.httpBasicAuth.enabledWhether http basic authentication is enabledfalse
influxOutput.httpBasicAuth.usernameDatabase usernametelegraf
influxOutput.httpBasicAuth.passwordDatabase passwordmetrics
The SLATE Team