Re: Benching ceph for high speed RBD

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey,

To be precise about ceph versions

15.2.14 from http://eu.ceph.com/debian-15.2.14/
15.2.15 from http://eu.ceph.com/debian-15.2.15/
- both of these versions reach
~75k 4k qd4 writes
~650k 4k qd64 reads
re-tested on 15.2.14 vanilla yesterday on a fresh cluster, 1h fio per each test)

15.2.14 (15.2.14-0ubuntu0.20.04.2) from http://archive.ubuntu.com/ubuntu/dists/focal-updates/universe
- this one for some reason is special (build options?)
~110k 4k qd4 writes
~750k 4k qd64 reads
also tested on a fresh cluster with 1h fio runs

The PCIe scheduler thing looks very interesting. Although I think the issue is limited in my setup as each container is pinned to the NUMA node where corresponding NVMe is connected. So only the network card might be in a different NUMA.

BR


On 2/23/22 22:33, Mark Nelson wrote:
Hi Bartosz,


Yep, my IOPS results are calculated the same way.  Basically just a sum of the averages as reported by fio with numjobs=1.  My numbers are obviously higher, but I'm giving the OSDs a heck of a lot more CPU and aggregate PCIe/Mem bus than you are so it's not unexpected.  It's interesting that 15.2.14 is showing the best results in your testing but none of the 15.2.X tests in my setup showed any real advantage.  Perhaps it has something to do with the way you aged/upgraded the cluster.


One issue that may be relevant for you:  At the 2021 Supercomputing Ceph BOF, Andras Pataki from the Flatiron Institute presented findings on their dual socket AMD Rome nodes where they were seeing significant performance impact when running lots of NVMe drives.  They believe the result was due to PCIe scheduler contention/latency with wide variations in performance depending on which CPU OSDs landed on relative to the NVMe drives and network.  AMD Rome systems typically have a special bios setting called "Preferred I/O" that improves scheduling for a single given PCIe device (which works), but at the expense of other PCIe devices so it doesn't really help.  I don't know if there is a recording of the talk, but it was extremely good. I suspect that may be impacting your tests, especially if the container setup is resulting in lots of OSDs landing on the wrong CPU relative to the NVMe drive.




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux