Graphing Oracle performance metrics with Telegraf, InfluxDB and Grafana

Introduction

As a picture is worth a thousand words, we may want to visualise the oracle performance metrics. There is already several tools to do so, but what if you could do it the way you want? build your own graph? Let’s try to achieve this with 3 layers:

telegraf: to collect the oracle metrics
InfluxDB: to store the time-series oracle metrics
grafana: to visualise the oracle time-series metrics

Installation

Let’s install InfluxDB and grafana that way (for example on the same host, namely influxgraf):

Install InfluxDB (detail here):

[root@influxgraf ~]# wget https://s3.amazonaws.com/influxdb/influxdb-0.10.2-1.x86_64.rpm

[root@influxgraf ~]# yum localinstall influxdb-0.10.2-1.x86_64.rpm

Install grafana (detail here):

[root@influxgraf ~]# yum install https://grafanarel.s3.amazonaws.com/builds/grafana-2.6.0-1.x86_64.rpm

Let’s install telegraf on the oracle host (namely Dprima) that way (detail here):

[root@Dprima ~]# wget http://get.influxdb.org/telegraf/telegraf-0.10.4.1-1.x86_64.rpm

[root@Dprima ~]# yum localinstall telegraf-0.10.4.1-1.x86_64.rpm

Setup

Create a telegraf database into InfluxDB (using the web interface):

Screen Shot 2016-03-05 at 09.06.10

Create a root user into InfluxDB (using the web interface):

Screen Shot 2016-03-05 at 09.08.38

Write a script to collect the oracle metrics. I am using python but this is not mandatory at all. Only the output of the script does matter, it has to be InfluxDB line-protocol. The script query the v$sysmetric and v$eventmetric views to get the wait class and the wait event metrics during the last minute. I am not reinventing the wheel, I am using Kyle Hailey’s queries. The python code is:

[oracle@Dprima scripts]$ cat oracle_metrics.py

import os  
import sys  
import cx_Oracle  
import argparse  
import re

class OraMetrics():  
def __init__(self, user, passwd, sid):  
import cx_Oracle  
self.user = user  
self.passwd = passwd  
self.sid = sid  
self.connection = cx_Oracle.connect( self.user , self.passwd , self.sid )  
cursor = self.connection.cursor()  
cursor.execute("select HOST_NAME from v$instance")  
for hostname in cursor:  
self.hostname = hostname[0]

def waitclassstats(self, user, passwd, sid):  
cursor = self.connection.cursor()  
cursor.execute("""  
select n.wait_class, round(m.time_waited/m.INTSIZE_CSEC,3) AAS  
from v$waitclassmetric m, v$system_wait_class n  
where m.wait_class_id=n.wait_class_id and n.wait_class != 'Idle'  
union  
select 'CPU', round(value/100,3) AAS  
from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2  
union select 'CPU_OS', round((prcnt.busy*parameter.cpu_count)/100,3) - aas.cpu  
from  
( select value busy  
from v$sysmetric  
where metric_name='Host CPU Utilization (%)'  
and group_id=2 ) prcnt,  
( select value cpu_count from v$parameter where name='cpu_count' ) parameter,  
( select 'CPU', round(value/100,3) cpu from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2) aas  
""")  
for wait in cursor:  
wait_name = wait[0]  
wait_value = wait[1]  
print "oracle_wait_class,host=%s,db=%s,wait_class=%s wait_value=%s" % (self.hostname, sid,re.sub(' ', '_', wait_name), wait_value )

def waitstats(self, user, passwd, sid):  
cursor = self.connection.cursor()  
cursor.execute("""  
select  
n.wait_class wait_class,  
n.name wait_name,  
m.wait_count cnt,  
round(10*m.time_waited/nullif(m.wait_count,0),3) avgms  
from v$eventmetric m,  
v$event_name n  
where m.event_id=n.event_id  
and n.wait_class &lt;> 'Idle' and m.wait_count > 0 order by 1 """)  
for wait in cursor:  
wait_class = wait[0]  
wait_name = wait[1]  
wait_cnt = wait[2]  
wait_avgms = wait[3]  
print "oracle_wait_event,host=%s,db=%s,wait_class=%s,wait_event=%s count=%s,latency=%s" % (self.hostname, sid,re.sub(' ', '_', wait_class),re.sub(' ', '_', wait_name)  
, wait_cnt,wait_avgms)

if __name__ == "__main__":  
parser = argparse.ArgumentParser()  
parser.add_argument('-u', '--user', help="Username", required=True)  
parser.add_argument('-p', '--passwd', required=True)  
parser.add_argument('-s', '--sid', help="SID", required=True)

args = parser.parse_args()

stats = OraMetrics(args.user, args.passwd, args.sid)  
stats.waitclassstats(args.user, args.passwd, args.sid)  
stats.waitstats(args.user, args.passwd, args.sid)  

The output looks like (the output format is the InfluxDB line-protocol):

[oracle@Dprima scripts]$ python "/home/oracle/scripts/oracle_metrics.py" "-u" "system" "-p" "bdtbdt" "-s" PBDT
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Administrative wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=CPU wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=CPU_OS wait_value=0.035
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Commit wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Concurrency wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Configuration wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Network wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Other wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=Scheduler wait_value=0
oracle_wait_class,host=Dprima,db=PBDT,wait_class=System_I/O wait_value=0.005
oracle_wait_class,host=Dprima,db=PBDT,wait_class=User_I/O wait_value=0
oracle_wait_event,host=Dprima,db=PBDT,wait_class=System_I/O,wait_event=control_file_sequential_read count=163,latency=0.009
oracle_wait_event,host=Dprima,db=PBDT,wait_class=System_I/O,wait_event=control_file_parallel_write count=60,latency=3.933
oracle_wait_event,host=Dprima,db=PBDT,wait_class=System_I/O,wait_event=log_file_parallel_write count=60,latency=1.35
oracle_wait_event,host=Dprima,db=PBDT,wait_class=User_I/O,wait_event=Disk_file_operations_I/O count=16,latency=0.037
oracle_wait_event,host=Dprima,db=PBDT,wait_class=User_I/O,wait_event=Parameter_File_I/O count=16,latency=0.004

On the oracle host, configure telegraf to execute the python script with 60 seconds interval and send the output to InfluxDB. Edit the /etc/telegraf/telegraf.conf file so that it contains:

###############################################################################
#                                  OUTPUTS                                    #
###############################################################################

# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  urls = ["http://influxgraf:8086"] # required
  database = "telegraf" # required
  precision = "s"
  timeout = "5s"


###############################################################################
#                                  INPUTS                                     #
###############################################################################

# Oracle metrics
[[inputs.exec]]
  # Shell/commands array
  commands = ["/home/oracle/scripts/oracle_metrics.sh"]
  # Data format to consume. This can be "json", "influx" or "graphite" (line-protocol)
  # NOTE json only reads numerical measurements, strings and booleans are ignored.
  data_format = "influx"
  interval = "60s"

The oracle_metrics.sh script contains the call to the python script:

#!/bin/env bash

export LD_LIBRARY_PATH=/u01/app/oracle/product/11.2.0/dbhome_1//lib
export ORACLE_HOME=/u01/app/oracle/product/11.2.0/dbhome_1/

python "/home/oracle/scripts/oracle_metrics.py" "-u" "system" "-p" "bdtbdt" "-s" PBDT

Launch telegraf:

[root@Dprima ~]# telegraf -config /etc/telegraf/telegraf.conf

Graph the metrics

First check that the metrics are stored the way we want into InfluxDB:

Screen Shot 2016-03-05 at 09.49.09

Configure InfluxDB as datasource in grafana:

Screen Shot 2016-03-05 at 10.03.55

In grafana, create a dashboard and create some variables (hosts, db and wait_class):

Screen Shot 2016-03-05 at 10.59.00

Screen Shot 2016-03-05 at 10.59.57

Now let’s create a graph:

Screen Shot 2016-03-05 at 11.04.09

Get the metrics:

Screen Shot 2016-03-05 at 11.06.18

So that the graph looks like:

Screen Shot 2016-03-05 at 11.20.18

Remarks

Thanks to the variables defined in grafana, we can filter on the host, database and wait class (instead of visualising all of them).
We can also graph the wait events we collected, or whatever you want to collect.

Conclusion

Thanks to:

telegraf and InfluxDB, we are able to collect and store the metrics we want to.
grafana, we are able to visualise the metrics the way we want to.

Update 04/21/2016:

As another example, with more informations collected:

dash1

dash2

dash3

dash4

dash5