Performance issue due to tuned

Mohamad Gebai <mgebai@xxxxxxx> · Thu, 24 Jan 2019 11:15:09 -0500

Hi all,

I want to share a performance issue I just encountered on a test cluster
of mine, specifically related to tuned. I started by setting the
"throughput-performance" tuned profile on my OSD nodes and ran some
benchmarks. I then applied that same profile to my client node, which
intuitively sounds like a reasonable thing to do (I do want to tweak my
client to maximize throughput if that's possible). Long story short, I
found out that one of the tweaks made by the "throughput-performance"
profile is to increase

kernel.sched_wakeup_granularity_ns = 15000000

which reduces the maximum throughput I'm able to get from 1080 MB/s to
1060 MB/s (-2.8%). The default value for sched_wakeup_granularity_ns
depends on the distro, on my system the default is 7.5ms. More info
about the benchmark:

- The benchmark tool is 'rados bench'
- The cluster has about 10 nodes with older hardware
- The client node has only 4 CPUs, the OSD nodes have 16 CPUs and 5 OSDs
each
- The throughput difference is always reproducible
- This was a read workload so that there is less volatility in the results
- I had all the data in BlueStore's cache on the OSD nodes so that
accessing the HDDs wouldn't skew the results
- I was looking at the difference of throughput once the benchmark
reaches its permanent regime, during which the throughput is very stable
(not surprising for a sequential read workload served from memory)

I have a theory which explains the reason for this reduced throughput.
The sched_wakeup_granularity_ns setting sets the minimum time a process
runs on a CPU before it can get preempted, so it looks like there might
be too much of a delay for rados bench's threads to get scheduled on-cpu
(higher latency from the moment a thread is woken up and goes in the CPU
runqueue to the time it is scheduled in and starts running) which
effectively results in a lower throughput overall.

We can measure that latency using 'perf sched timehist':

           time    cpu  task name                       wait time  sch
delay   run time
                        [tid/pid]                          (msec)    
(msec)     (msec)
--------------- ------  ------------------------------  --------- 
---------  ---------
 3279952.180957 [0002]  msgr-worker-1[50098/50094]          0.154     
0.021      0.135

it is shown in the 5th column (sch delay). If we look at the average of
'sch delay' for a lower throughput run, we get:

$> perf sched timehist -i perf.data.slow | egrep 'msgr|rados' | awk '{
total += $5; count++ } END { print total/count }'
0.0243015

And for a higher throughput run:

$> perf sched timehist -i perf.data.fast | egrep 'msgr|rados' | awk '{
total += $5; count++ } END { print total/count }'
0.00401659

There is on average a 20ms longer delay for "wakeup-to-sched-in" with
the throughput-performance profile enabled on the client due to the
sched_wakeup_granularity_ns setting. The fact that there are few CPUs on
that node doesn't help. If I set the number of concurrent IOs to 1, I
get the same throughput for both values of sched_wakeup_granularity,
because there is (almost) always an available CPU, which means that
rados bench's threads don't have to wait as long to get scheduled in and
start consuming data.

On the other hand, increasing sched_wakeup_granularity_ns on the OSD
nodes doesn't reduce the throughput because there are more CPUs than
there are OSDs, and the wakeup-to-sched delay is "diluted" by the
latency of reading/writing/moving data around.

I'm curious to know if this theory makes sense, and if other people have
encountered similar situations (with tuned or otherwise).

Mohamad

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com