Hi all, I want to share a performance issue I just encountered on a test cluster of mine, specifically related to tuned. I started by setting the "throughput-performance" tuned profile on my OSD nodes and ran some benchmarks. I then applied that same profile to my client node, which intuitively sounds like a reasonable thing to do (I do want to tweak my client to maximize throughput if that's possible). Long story short, I found out that one of the tweaks made by the "throughput-performance" profile is to increase kernel.sched_wakeup_granularity_ns = 15000000 which reduces the maximum throughput I'm able to get from 1080 MB/s to 1060 MB/s (-2.8%). The default value for sched_wakeup_granularity_ns depends on the distro, on my system the default is 7.5ms. More info about the benchmark: - The benchmark tool is 'rados bench' - The cluster has about 10 nodes with older hardware - The client node has only 4 CPUs, the OSD nodes have 16 CPUs and 5 OSDs each - The throughput difference is always reproducible - This was a read workload so that there is less volatility in the results - I had all the data in BlueStore's cache on the OSD nodes so that accessing the HDDs wouldn't skew the results - I was looking at the difference of throughput once the benchmark reaches its permanent regime, during which the throughput is very stable (not surprising for a sequential read workload served from memory) I have a theory which explains the reason for this reduced throughput. The sched_wakeup_granularity_ns setting sets the minimum time a process runs on a CPU before it can get preempted, so it looks like there might be too much of a delay for rados bench's threads to get scheduled on-cpu (higher latency from the moment a thread is woken up and goes in the CPU runqueue to the time it is scheduled in and starts running) which effectively results in a lower throughput overall. We can measure that latency using 'perf sched timehist': time cpu task name wait time sch delay run time [tid/pid] (msec) (msec) (msec) --------------- ------ ------------------------------ --------- --------- --------- 3279952.180957 [0002] msgr-worker-1[50098/50094] 0.154 0.021 0.135 it is shown in the 5th column (sch delay). If we look at the average of 'sch delay' for a lower throughput run, we get: $> perf sched timehist -i perf.data.slow | egrep 'msgr|rados' | awk '{ total += $5; count++ } END { print total/count }' 0.0243015 And for a higher throughput run: $> perf sched timehist -i perf.data.fast | egrep 'msgr|rados' | awk '{ total += $5; count++ } END { print total/count }' 0.00401659 There is on average a 20ms longer delay for "wakeup-to-sched-in" with the throughput-performance profile enabled on the client due to the sched_wakeup_granularity_ns setting. The fact that there are few CPUs on that node doesn't help. If I set the number of concurrent IOs to 1, I get the same throughput for both values of sched_wakeup_granularity, because there is (almost) always an available CPU, which means that rados bench's threads don't have to wait as long to get scheduled in and start consuming data. On the other hand, increasing sched_wakeup_granularity_ns on the OSD nodes doesn't reduce the throughput because there are more CPUs than there are OSDs, and the wakeup-to-sched delay is "diluted" by the latency of reading/writing/moving data around. I'm curious to know if this theory makes sense, and if other people have encountered similar situations (with tuned or otherwise). Mohamad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com