Hey,
To be precise about ceph versions
15.2.14 from http://eu.ceph.com/debian-15.2.14/
15.2.15 from http://eu.ceph.com/debian-15.2.15/
- both of these versions reach
~75k 4k qd4 writes
~650k 4k qd64 reads
re-tested on 15.2.14 vanilla yesterday on a fresh cluster, 1h fio per
each test)
15.2.14 (15.2.14-0ubuntu0.20.04.2) from
http://archive.ubuntu.com/ubuntu/dists/focal-updates/universe
- this one for some reason is special (build options?)
~110k 4k qd4 writes
~750k 4k qd64 reads
also tested on a fresh cluster with 1h fio runs
The PCIe scheduler thing looks very interesting. Although I think the
issue is limited in my setup as each container is pinned to the NUMA
node where corresponding NVMe is connected. So only the network card
might be in a different NUMA.
BR
On 2/23/22 22:33, Mark Nelson wrote:
Hi Bartosz,
Yep, my IOPS results are calculated the same way. Basically just a sum
of the averages as reported by fio with numjobs=1. My numbers are
obviously higher, but I'm giving the OSDs a heck of a lot more CPU and
aggregate PCIe/Mem bus than you are so it's not unexpected. It's
interesting that 15.2.14 is showing the best results in your testing but
none of the 15.2.X tests in my setup showed any real advantage. Perhaps
it has something to do with the way you aged/upgraded the cluster.
One issue that may be relevant for you: At the 2021 Supercomputing Ceph
BOF, Andras Pataki from the Flatiron Institute presented findings on
their dual socket AMD Rome nodes where they were seeing significant
performance impact when running lots of NVMe drives. They believe the
result was due to PCIe scheduler contention/latency with wide variations
in performance depending on which CPU OSDs landed on relative to the
NVMe drives and network. AMD Rome systems typically have a special bios
setting called "Preferred I/O" that improves scheduling for a single
given PCIe device (which works), but at the expense of other PCIe
devices so it doesn't really help. I don't know if there is a recording
of the talk, but it was extremely good. I suspect that may be impacting
your tests, especially if the container setup is resulting in lots of
OSDs landing on the wrong CPU relative to the NVMe drive.