influxdb, grafana y telegraf. pruebas con inputs.exec paralelizados

un container de influxdb (la imagen oficial):

docker run -p 8086:8086 -v /home/usuario/trabajo/influxdb_20181006/:/var/lib/influxdb influxdb

dentro del container podemos acceder a la linea de ordenes de influx, y hacer y deshacer como si fuese un SQL cualquiera:

influxConnected to http://localhost:8086 version 1.6.3
InfluxDB shell version: 1.6.3
> show databases
name: databases
name
----
_internal
> create database prueba
> show databases
name: databases
name
----
_internal
prueba
> create user usuario with password 'xxxxxxx'
> grant all on prueba to usuario
> show retention policy
ERR: error parsing query: found POLICY, expected POLICIES at line 1, char 16
Warning: It is possible this error is due to not setting a database.
Please set a database with the command "use <database>".
> show retention policies
ERR: database name required
Warning: It is possible this error is due to not setting a database.
Please set a database with the command "use <database>".
> show retention policies on _internal
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
monitor 168h0m0s 24h0m0s 1 true
> create retention policy dia on prueba duration 1d replication 1 default
> show retention policies on prueba
name duration shardGroupDuration replicaN default
---- -------- ------------------ -------- -------
autogen 0s 168h0m0s 1 false
dia 24h0m0s 1h0m0s 1 true

Ya tenemos un influxdb funcionando. Lo siguiente es un CentOS7 para que sea el nodo a monitorizar (con telegraf):

docker run -ti centos

seguimos los pasos de esta guia para instalar Telegraf.

por ultimo lanzamos la imagen de grafana:

docker run -d --name=grafana -p 3000:3000 grafana/grafana

despues importamos en grafana el dashboard predefinido de metricas del sistema (linux), y listo!

la cantidad de informacion visual que tenemos es ENORME. Y esta bien organizada.

Cualquiera de las SELECTs que aparecen en las graficas se puede lanzar desde la linea de comandos de influxdb..

…

ahora seria cuestion de hacer avanzar el problema:

Probablemente, el input que estamos usando para hacer curls sea input.exec. Segun la documentacion acepta varios scripts diferentes para ser lanzados…

….

el escenario:

una maquina con telegraf y el siguiente telegraf.conf

[global_tags]
[agent]
    interval = "30s"
    debug = false
    hostname = "server-hostname"
    round_interval = false
    flush_interval = "10s"
    flush_jitter = "0s"
    collection_jitter = "0s"
    metric_batch_size = 1000
    metric_buffer_limit = 10000
    quiet = false
    logfile = ""
    omit_hostname = false
    precision = "1ns"
[[outputs.influxdb]]
    urls = ["http://192.168.1.133:8086"]
    database = "prueba"
    timeout = "5s"
    username = "usuario"
    password = "xxxxxxxx"
    retention_policy = "dia"
    precision = "1ns"
[[inputs.cpu]]
    percpu = true
    totalcpu = true
    collect_cpu_time = false
    report_active = false
[[inputs.disk]]
    ignore_fs = ["tmpfs", "devtmpfs", "devfs"]
[[inputs.io]]
[[inputs.mem]]
[[inputs.net]]
[[inputs.system]]
[[inputs.swap]]
[[inputs.netstat]]
[[inputs.processes]]
[[inputs.kernel]]
[[inputs.exec]]
  commands = [
    "/usr/local/bin/test.sh uno",
    "/usr/local/bin/test.sh dos",
    "/usr/local/bin/test.py tres",
    "/usr/local/bin/test.py cuatro"    
  ]
  timeout = "20s"
  name_suffix = "_mycollector"
  data_format = "influx"

dos scripts en /usr/local/bin:

[root@e227e457f593 bin]# cat test.sh 
#!/bin/sh
#sleep 5
TIEMPO=$(date +"%s%N")
echo "example,tag1=$1,tag2=$TIEMPO i=42i,j=43i,k=44i $TIEMPO"

[root@e227e457f593 bin]# cat test.py 
#!/usr/bin/python
import sys
import time
time.sleep(5)
val = sys.argv[1]
print 'example,tag1={0},tag2=b i=42i,j=43i,k=44i'.format(val)

Cuando “toca”, telegraf hace las cuatro llamadas simultaneamente, por lo que se levantan dos scripts de bash y dos de python, cada uno con su parametro correspondiente. De esta forma, la “paralelización” es cosa de telegraf, que será el que espere a la salida de todos y cada uno de los scripts para lanzar los datos a influx.

telegraf 417 0 0 17:49 ? 00:00:00 /usr/bin/telegraf -pidfile /var/run/telegraf/telegraf.pid -config /etc/telegraf/telegraf.conf -config-directory /etc/telegraf/telegraf.d
telegraf 693 417 2 17:56 ? 00:00:00 /usr/bin/python /usr/local/bin/test.py cuatro
telegraf 694 417 0 17:56 ? 00:00:00 /bin/sh /usr/local/bin/test.sh dos
telegraf 695 417 2 17:56 ? 00:00:00 /usr/bin/python /usr/local/bin/test.py tres
telegraf 696 417 0 17:56 ? 00:00:00 /bin/sh /usr/local/bin/test.sh uno

lo que resulta en cuatro entradas en influxdb, que podemos ver con el mismo TIMESTAMP ( el de la inserción desde telegraf)

select * from example_mycollector 
1538858554409569900 server-hostname 42 43 44 uno  1538858554409569900
1538858554410877352 server-hostname 42 43 44 dos  1538858554410877352

…

mientras tanto, el debate sobre la precision de la toma de datos, que si nanosegundos o qué.