Re: Benching ceph for high speed RBD

"bartosz.rabiega@xxxxxxxxxxxx" <bartosz.rabiega@xxxxxxxxxxxx> · Thu, 24 Feb 2022 11:38:23 +0100

Hey,

To be precise about ceph versions

15.2.14 from http://eu.ceph.com/debian-15.2.14/
15.2.15 from http://eu.ceph.com/debian-15.2.15/
- both of these versions reach
~75k 4k qd4 writes
~650k 4k qd64 reads
re-tested on 15.2.14 vanilla yesterday on a fresh cluster, 1h fio per 
each test)

15.2.14 (15.2.14-0ubuntu0.20.04.2) from 
http://archive.ubuntu.com/ubuntu/dists/focal-updates/universe
- this one for some reason is special (build options?)
~110k 4k qd4 writes
~750k 4k qd64 reads
also tested on a fresh cluster with 1h fio runs

The PCIe scheduler thing looks very interesting. Although I think the 
issue is limited in my setup as each container is pinned to the NUMA 
node where corresponding NVMe is connected. So only the network card 
might be in a different NUMA.

BR

On 2/23/22 22:33, Mark Nelson wrote:
Hi Bartosz,

Yep, my IOPS results are calculated the same way.  Basically just a sum 
of the averages as reported by fio with numjobs=1.  My numbers are 
obviously higher, but I'm giving the OSDs a heck of a lot more CPU and 
aggregate PCIe/Mem bus than you are so it's not unexpected.  It's 
interesting that 15.2.14 is showing the best results in your testing but 
none of the 15.2.X tests in my setup showed any real advantage.  Perhaps 
it has something to do with the way you aged/upgraded the cluster.

One issue that may be relevant for you:  At the 2021 Supercomputing Ceph 
BOF, Andras Pataki from the Flatiron Institute presented findings on 
their dual socket AMD Rome nodes where they were seeing significant 
performance impact when running lots of NVMe drives.  They believe the 
result was due to PCIe scheduler contention/latency with wide variations 
in performance depending on which CPU OSDs landed on relative to the 
NVMe drives and network.  AMD Rome systems typically have a special bios 
setting called "Preferred I/O" that improves scheduling for a single 
given PCIe device (which works), but at the expense of other PCIe 
devices so it doesn't really help.  I don't know if there is a recording 
of the talk, but it was extremely good. I suspect that may be impacting 
your tests, especially if the container setup is resulting in lots of 
OSDs landing on the wrong CPU relative to the NVMe drive.

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx