Re: How to maximize the OSD effective queue depth in Ceph?

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Tue, 6 Aug 2019 16:26:09 -0700

> However, I'm starting to think that the problem isn't with the number
> of threads that have work to do... the problem may just be that the
> OSD & PG code has enough thread locking happening that there is no
> possible way to have more than a few things happening on a single OSD
> (or perhaps a single placement group).
> 
> Has anyone thought about the problem from this angle?  This would help
> explain why multiple-OSDs-per-SSD is so effective (even though the
> thought of doing this in production is utterly terrifying).

When researching this topic a few months back the below is what I found, HTH.  We’re planning to break up NVMe drives into multiple OSDs.  I don’t find this terrifying so much as somewhat awkward, we’ll have to update deployment and troubleshooting/maintenance procedures to act accordingly. 

Back in the day it was conventional Ceph wisdom to never put multiple OSDs on a single device, but my sense was that was an artifact of bottlenecked spinners.  The resultant seek traffic I imagine could be ugly, but would it be worse than we already suffered with colo journals?  (*)  With a device that can handle lots of IO depth without seeks, IMHO it’s not so bad, especially as Ceph has evolved to cope better with larger numbers of OSDs.

"per-osd session lock", "all AIO completions are fired from a single thread – so even if you are pumping data to the OSDs using 8 threads, you are only getting serialized completions”

https://apawel.me/ceph-creating-multiple-osds-on-nvme-devices-luminous/

https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en

https://www.spinics.net/lists/ceph-devel/msg41570.html

https://bugzilla.redhat.com/show_bug.cgi?id=1541415

http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning

> With block sizes 64K and lower the avgqu-sz value never went above 1
> under any workload, and I never saw the iostat util% much above 50%.

I’ve been told that iostat %util isn’t as meaningful with SSDs as it was with HDDs, but I don’t recall the rationale.  ymmv.

*  And ohhhhh did we suffer from them :-x

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com