Exadata real-time metrics extracted from cumulative metrics

Exadata provides a lot of useful metrics to monitor the Cells.

The Metrics can be of various types:

Cumulative: Cumulative statistics since the metric was created.
Instantaneous: Value at the time that the metric is collected.
Rate: Rates computed by averaging statistics over observation periods.
Transition: Are collected at the time when the value of the metrics has changed, and typically captures important transitions in hardware status.

You can found some information on how to exploit those metrics in those posts:

UWE HESSE’s post

TANEL PODER’s post

But I think those types of metrics are not enough to answer all the basic questions.

Let me explain why with 2 examples:

Let’s have a look to the metrics GD_IO_RQ_W_SM and GD_IO_RQ_W_SM_SEC (restricted to one Grid Disk for lisibility):

dcli -c cell1 cellcli -e "list metriccurrent attributes name,metricType,metricObjectName,metricValue where name like \'.*GD_IO_RQ_W_SM.*\' and metricObjectName ='data_CD_disk01_cell'"

cell1: GD_IO_RQ_W_SM        Cumulative  data_CD_disk01_cell     2,930 IO requests
cell1: GD_IO_RQ_W_SM_SEC    Rate        data_CD_disk01_cell     0.3 IO/sec

So we can observe that this “cumulative” metric shows the number of small write I/O requests while its associated “rate” metric shows the number of small write I/O requests per seconds.

Let’s have a look to the metrics CD_IO_TM_W_SM and CD_IO_TM_W_SM_RQ (restricted to one Cell Disk for lisibility):

dcli -c cell1 cellcli -e "list metriccurrent attributes name,metricType,metricObjectName,metricValue where name like \'.*CD_IO_TM_W.*SM.*\' and metricobjectname='CD_disk07_cell'"

cell1: CD_IO_TM_W_SM        Cumulative  CD_disk07_cell      1,512,939 us
cell1: CD_IO_TM_W_SM_RQ     Rate        CD_disk07_cell      168 us/request

So we can observe that this “cumulative” metric shows the small write I/O latency in us while its associated “rate” metric shows the small write I/O latency in us per request.

But how can I answer those questions:

How many small write I/O requests have been done during the last 80 seconds? (Unfortunately 0.3 * 80 will not necessary provide the right answer as it depends of the “observation period” of the rate metrics)
What was the small write I/O latency during the last 80 second ?

You could ask for the same kind of questions on all cumulative metrics.

To answer all those questions I created a perl script exadata_metrics.pl (click on the link and then on the view source button to copy/paste the source code) that extracts exadata real-time information metrics based on cumulative metrics.

That is to say the script works with all the cumulative metrics (the following command list all of them) :

cellcli -e "list metriccurrent attributes name,metricType where metricType='Cumulative'"

To extract real-time information the script takes a snapshot of cumulative metrics each second (default interval) and computes the differences with the previous snapshot.

So, to get the answer to our first question :

./exadata_metrics.pl 80 cell=cell1 name='GD_IO_RQ_W_SM' metricobjectname='data_CD_disk01_cell'

04:30:38 CELL   NAME            OBJECTNAME      VALUE
04:30:38 cell1  GD_IO_RQ_W_SM       data_CD_disk01_cell     0.00 IO requests
--------------------------------------> NEW
04:31:58 CELL   NAME            OBJECTNAME      VALUE
04:31:58 cell1  GD_IO_RQ_W_SM       data_CD_disk01_cell     20.00 IO requests

As you can see 20 small write I/O requests have been generated during the last 80 seconds (which is different from 0.3*80).

To get the answer to our second question :

./exadata_metrics.pl 80 cell=cell1 name_like='.*CD_IO_TM_W.*SM.*' metricobjectname='CD_disk07_cell'

06:48:33 CELL   NAME            OBJECTNAME      VALUE
06:48:33 cell1  CD_IO_TM_W_SM       CD_disk07_cell      0.00 us
--------------------------------------> NEW
06:49:53 CELL   NAME            OBJECTNAME      VALUE
06:49:53 cell1  CD_IO_TM_W_SM       CD_disk07_cell      3613.00 us

As you can see we the small write I/O latency has been 3613 us during the last 80 seconds.

Let’s see the help of the script:

./exadata_metrics.pl help

Usage: ./exadata_metrics.pl [Interval [Count]] [cell=] [top=] [name=] [metricobjectname=] [name_like=] [metricobjectname_like=]

Default Interval : 1 second.
Default Count : Unlimited

Parameter       Comment                     Default
---------       -------                     -------
CELL=           comma separated list of cell to display
TOP=            Number of rows to display           10
NAME=           ALL - Show all cumulative metrics       ALL
NAME_LIKE=      ALL - Show all cumulative metrics       ALL
METRICOBJECTNAME=   ALL - Show all objects              ALL
METRICOBJECTNAME_LIKE=  ALL - Show all objects              ALL

Example: ./exadata_metrics.pl cell=cell1,cell2 name_like='.*FC.*'
Example: ./exadata_metrics.pl cell=cell1,cell2 name='CD_IO_BY_W_LG'
Example: ./exadata_metrics.pl cell=cell1,cell2 name='CD_IO_BY_W_LG' metricobjectname_like='.*disk.*'

The script is based on the dcli and the cellcli commands and their regular expressions (wich are described into Kerry Osborne’s post).

You can choose the number of snapshots to display and the time to wait between snapshots.
You can choose to filter on name and metricobjectname based on like or equal predicates.
You can work on all the cells or a subset thanks to the mandatory CELL parameter.
A cell os user allowed to run dcli without password (celladmin for example) can launch the script (ORACLE_HOME must be set).

Please don’t hesitate to tell me if this is useful for you and if you find any issues with this script.

Updates:

New features have been added, please see this post.
You should read this post for a better interpretation of the utility: Exadata Cell metrics: collectionTime attribute, something that matters