Measure the impact of remote versus local NUMA node access thanks to processor_group

Introduction

Starting with oracle database 12c you can use the processor_group_name parameter to instruct the database instance to run itself within the specified operating system processor group. You can find more detail about this feature into Nikolay Manchev’s blog post.

So basically, once the libcgroup package has been installed:

> yum install libcgroup

you can create an “oracle” (for example) group by editing the /etc/cgconfig.conf file so that it looks like:

> cat /etc/cgconfig.conf 
mount {
        cpuset  = /cgroup/cpuset;
        cpu     = /cgroup/cpu;
        cpuacct = /cgroup/cpuacct;
        memory  = /cgroup/memory;
        devices = /cgroup/devices;
        freezer = /cgroup/freezer;
        net_cls = /cgroup/net_cls;
        blkio   = /cgroup/blkio;
}

group oracle {
   perm {
     task {
       uid = oracle;
       gid = dba;
     }
     admin {
       uid = oracle;
       gid = dba;
     }
   }
   cpuset {
     cpuset.mems="0";
     cpuset.cpus="1-9";
   }
}

start the cgconfig service:

> service cgconfig start

And then set the processor_group_name instance parameter to the group (“oracle” in our example).

The interesting part is that to setup the group we have to use 2 mandatory parameters into the cpuset:

cpuset.mems: specifies the memory nodes that tasks in this cgroup are permitted to access.
cpuset.cpus: specifies the CPUs that tasks in this cgroup are permitted to access.

So, it means that with processor_group_name set:

The SGA will be allocated according to the cpuset.mems parameter.
The processes will be bind to the CPUs according to the cpuset.cpus parameter.

Then, we can measure the impact of remote versus local NUMA node access thanks to the processor_group_name parameter: Let’s do it.

Tests setup

My NUMA configuration is the following:

> numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 1 2 3 4 5 6 7 8 9 10 41 42 43 44 45 46 47 48 49 50
node 0 size: 132864 MB
node 0 free: 45326 MB
node 1 cpus: 11 12 13 14 15 16 17 18 19 20 51 52 53 54 55 56 57 58 59 60
node 1 size: 132864 MB
node 1 free: 46136 MB
node 2 cpus: 21 22 23 24 25 26 27 28 29 30 61 62 63 64 65 66 67 68 69 70
node 2 size: 132864 MB
node 2 free: 40572 MB
node 3 cpus: 31 32 33 34 35 36 37 38 39 40 71 72 73 74 75 76 77 78 79 80
node 3 size: 132864 MB
node 3 free: 47145 MB
node 4 cpus: 0 81 82 83 84 85 86 87 88 89 120 121 122 123 124 125 126 127 128 129
node 4 size: 132836 MB
node 4 free: 40573 MB
node 5 cpus: 90 91 92 93 94 95 96 97 98 99 130 131 132 133 134 135 136 137 138 139
node 5 size: 132864 MB
node 5 free: 45564 MB
node 6 cpus: 100 101 102 103 104 105 106 107 108 109 140 141 142 143 144 145 146 147 148 149
node 6 size: 132864 MB
node 6 free: 20592 MB
node 7 cpus: 110 111 112 113 114 115 116 117 118 119 150 151 152 153 154 155 156 157 158 159
node 7 size: 132864 MB
node 7 free: 44170 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  13  12  12  22  22  22  22 
  1:  13  10  12  12  22  22  22  22 
  2:  12  12  10  13  22  22  22  22 
  3:  12  12  13  10  22  22  22  22 
  4:  22  22  22  22  10  13  12  12 
  5:  22  22  22  22  13  10  12  12 
  6:  22  22  22  22  12  12  10  13 
  7:  22  22  22  22  12  12  13  10

As you can see:

CPUs 1 to 9 are located into node 0.
node 0 is far from node 7 (distance equals 22).

Then I am able to test local NUMA node access by creating the group with:

cpuset.cpus=”1-9”
cpuset.mems=”0”

As CPUs 1-9 are located into node 0 and the SGA will be created into node 0.

and I am able to test remote NUMA node access by creating the group with:

cpuset.mems=”7”
cpuset.cpus=”1-9”

As CPUs 1-9 are located into node 0 while the SGA will be created into node 7.

To test the effect of remote versus local NUMA node access, I’ll use SLOB. From my point of view this is the perfect/mandatory tool to test the effect of NUMA from the oracle side.

For the tests I created 8 SLOB schemas (with SCALE=10000, so 80 MB), applied some parameters at the instance level for Logical IO testing (as we need FULL cached SLOB to test NUMA efficiently):

> cat lio_init.sql
alter system reset "_db_block_prefetch_limit" scope=spfile sid='*';
alter system reset "_db_block_prefetch_quota"  scope=spfile sid='*';
alter system reset "_db_file_noncontig_mblock_read_count" scope=spfile sid='*';
alter system reset "cpu_count" scope=spfile sid='*';
alter system set "db_cache_size"=20g scope=spfile sid='*';
alter system set "shared_pool_size"=10g scope=spfile sid='*';
alter system set "sga_target"=0 scope=spfile sid='*';

and set processor_group_name to oracle:

SQL> alter system set processor_group_name='oracle' scope=spfile sid='*';

For accurate comparison, each test will be based on the same SLOB configuration and workload: 8 readers and fix run time of 3 minutes. Then I’ll compare how many logical reads have been done in both cases.

Test 1: Measure local NUMA node access:

As explained previously, I do it by updating /etc/cgconfig.conf file that way:

cpuset.cpus=”1-9”
cpuset.mems=”0”

re-start the service:

> service cgconfig restart

and the database instance.

Check that the SGA is allocated into node 0:

> grep -i HugePages /sys/devices/system/node/node*/meminfo

/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Total: 39303
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Free:  23589
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Surp:      0
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Total: 39302
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Free:  39046
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Surp:      0
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Total: 39302
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Free:  39047
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Surp:      0
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Total: 39302
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Free:  39047
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Surp:      0
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Total: 39302
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Free:  39047
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Surp:      0
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Total: 39302
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Free:  39047
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Surp:      0
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Total: 39302
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Free:  39047
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Surp:      0
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Total: 39302
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Free:  39047
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Surp:      0

As you can see, 39303-23589 large pages have been allocated from node 0. This is the amount of large pages needed for my SGA.

Then I launched SLOB 2 times with the following results:

Load Profile                        Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~                 ---------------   --------------- --------- ---------
Run 1:  Logical read (blocks):          4,551,737.9      52,878,393.1
Run 2:  Logical read (blocks):      4,685,712.3      18,984,632.1


                                            Total Wait       Wait   % DB Wait
Event                                   Waits   Time (sec)    Avg(ms)   time Class
------------------------------    -----------   ---------- ---------- ------ --------
Run 1: DB CPU                                       1423.2              97.8
Run 2: DB CPU                                       1431.4              99.2

As you can see, during the 3 minutes run time, SLOB generated about 4 500 000 logical reads per second with local NUMA node access in place. Those are also FULL cached SLOB runs as the top event is “DB CPU” with about 100% of the DB time.

Test 2: Measure remote NUMA node access:

As explained previously, I do it by updating /etc/cgconfig.conf file that way:

cpuset.cpus=”1-9”
cpuset.mems=”7”

re-start the service:

> service cgconfig restart

and the database instance.

Check that the SGA is allocated into node 7:

> grep -i HugePages /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Total: 39303
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Free:  39047
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Surp:      0
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Total: 39302
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Free:  39046
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Surp:      0
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Total: 39302
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Free:  39047
/sys/devices/system/node/node2/meminfo:Node 2 HugePages_Surp:      0
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Total: 39302
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Free:  39047
/sys/devices/system/node/node3/meminfo:Node 3 HugePages_Surp:      0
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Total: 39302
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Free:  39047
/sys/devices/system/node/node4/meminfo:Node 4 HugePages_Surp:      0
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Total: 39302
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Free:  39047
/sys/devices/system/node/node5/meminfo:Node 5 HugePages_Surp:      0
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Total: 39302
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Free:  39047
/sys/devices/system/node/node6/meminfo:Node 6 HugePages_Surp:      0
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Total: 39302
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Free:  23557
/sys/devices/system/node/node7/meminfo:Node 7 HugePages_Surp:      0

As you can see, 39302-23557 large pages have been allocated from node 7. This is the amount of large pages needed for my SGA.

Then I launched SLOB 2 times with the following results:

Load Profile                        Per Second   Per Transaction  Per Exec  Per Call
~~~~~~~~~~~~~~~                 ---------------   --------------- --------- ---------
Run 1:  Logical read (blocks):          2,133,879.6      24,991,464.4
Run 2:  Logical read (blocks):      2,203,307.5      10,879,098.7


                                            Total Wait       Wait   % DB Wait
Event                                   Waits   Time (sec)    Avg(ms)   time Class
------------------------------    -----------   ---------- ---------- ------ --------
Run 1: DB CPU                                       1420.7              97.7
Run 2: DB CPU                                       1432.6              99.0

As you can see, during the 3 minutes run time, SLOB generated about 2 100 000 logical reads per second with remote NUMA node access in place. Those are also FULL cached SLOB runs as the top event is “DB CPU” with about 100% of the DB time.

This is about 2.15 times less than with the local NUMA node access. This is about the ratio of the distance between node 0 to node 7 and node 0 to node 0 (22/10).

Remarks:

The SGA is fully made of large pages as I set the use_large_pages parameter to “only”.
To check that only the CPUs 1 to 9 worked during the SLOB run you can read the mpstat.out file provided by SLOB at the end of the test.
To check that the oracle session running the SLOB run are bind to CPUs 1 to 9 you can use taskset:

> ps -ef | grep 641387
oracle    641387  641353 87 10:15 ?        00:00:55 oracleBDT12CG_1 (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))
> taskset -c -p 641387
pid 641387's current affinity list: 1-9

To check that your SLOB Logical I/O test delivers the maximum, you can read this post.
With SLOB schema of 8 MB, I was not able to see this huge difference between remote and local NUMA node access. I guess this is due to the L3 cache of 24576K on the cpu used during my tests.

Conclusion:

Thanks to the 12c database parameter processor_group_name, we have been able to measure the impact of remote versus local NUMA node access from the database point of view.
If you plan the use the processor_group_name database parameter, then pay attention to configure the group correctly with cpuset.cpus and cpuset.mems to ensure local NUMA node access (My tests show 2.15 times less logical reads per second with my farest remote access).

Useful links related to NUMA:

You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part I from Kevin Closson
You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part II from Kevin Closson
You Buy a NUMA System, Oracle Says Disable NUMA! What Gives? Part III from Kevin Closson
Little things I didn’t know: difference between _enable_NUMA_support and numactl from Martin Bach
Linux large pages and non-uniform memory distribution from Martin Bach