qd=1 bs=4k tuning on a toy cluster

Tyler Stachecki <stachecki.tyler@xxxxxxxxx> · Sun, 16 Jan 2022 17:38:22 -0500

I'm curious to hear if anyone has looked into kernel scheduling tweaks
or changes in order to improve qd=1/bs=4k performance (while we
patiently wait for Seastar!)

Using this toy cluster:
3x OSD nodes: Atom C3758 (8 core, 2.2GHz) with 1x Intel S4500 SSD each
Debian Bullseye with Linux 5.15.14 and Pacific 16.2.7 both compiled from source.

I was able to get ~550 IOPS "out of the box" using the below command
to benchmark.  By "out of the box", I mean applying the *proven*
optimizations like forcing the procs into C1, disabling p-states, etc.
$ rbd bench --io-type write image01 -p testbench --io-threads=1
--io-size 4K --io-pattern rand --rbd_cache=false

After looking at flamegraphs/perf/etc, I suspected some ping-ponging
between some threads. Maybe this is due to the "low" number of cores
on the toy cluster?  I also observed some tunables like "ms async send
inline" helped slightly, but are default off because they don't work
well with high core counts.

As more of a dire step, I partitioned the system into two cgroups; 6
cores for the OSDs and 2 cores for everything else ("junk" cores,
effectively)  I should also mention that my kernel is compiled with
NO_HZ and that I enabled it for the non-junk cores here, in thinking
that maybe that some of the Ceph tasks could benefit from an
opportunity to go tickless.  In doing so, I also noticed a small
improvement.  But at this point, I was still only at maybe ~600-625
IOPS or so.

I next assigned the OSD processes a SCHED_RR policy and wow...
immediate jump to 800+ IOPS.  For a cluster that consumes < 100W
(switch included), I am pretty pleased!  I'm not really sure of the
implications of having the OSD threads being in RR though, even though
the cores are dedicated for that purpose...

Anyways, has anyone looked at goofing around with scheduling
parameters on something resembling a production-grade Ceph cluster?
I'm curious to know if it helps -- it's probably no secret that
bluestore's current "low" QD1 performance is hindered by the locks and
context switching involved in the lifecycle of an IOP... maybe a few
scheduler changes could help unlock a good chunk of performance?

Cheers,
Tyler
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx