> However, I'm starting to think that the problem isn't with the number > of threads that have work to do... the problem may just be that the > OSD & PG code has enough thread locking happening that there is no > possible way to have more than a few things happening on a single OSD > (or perhaps a single placement group). > > Has anyone thought about the problem from this angle? This would help > explain why multiple-OSDs-per-SSD is so effective (even though the > thought of doing this in production is utterly terrifying). When researching this topic a few months back the below is what I found, HTH. We’re planning to break up NVMe drives into multiple OSDs. I don’t find this terrifying so much as somewhat awkward, we’ll have to update deployment and troubleshooting/maintenance procedures to act accordingly. Back in the day it was conventional Ceph wisdom to never put multiple OSDs on a single device, but my sense was that was an artifact of bottlenecked spinners. The resultant seek traffic I imagine could be ugly, but would it be worse than we already suffered with colo journals? (*) With a device that can handle lots of IO depth without seeks, IMHO it’s not so bad, especially as Ceph has evolved to cope better with larger numbers of OSDs. "per-osd session lock", "all AIO completions are fired from a single thread – so even if you are pumping data to the OSDs using 8 threads, you are only getting serialized completions” https://apawel.me/ceph-creating-multiple-osds-on-nvme-devices-luminous/ https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en https://www.spinics.net/lists/ceph-devel/msg41570.html https://bugzilla.redhat.com/show_bug.cgi?id=1541415 http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning > With block sizes 64K and lower the avgqu-sz value never went above 1 > under any workload, and I never saw the iostat util% much above 50%. I’ve been told that iostat %util isn’t as meaningful with SSDs as it was with HDDs, but I don’t recall the rationale. ymmv. * And ohhhhh did we suffer from them :-x _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com