Re: How to maximize the OSD effective queue depth in Ceph?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> However, I'm starting to think that the problem isn't with the number
> of threads that have work to do... the problem may just be that the
> OSD & PG code has enough thread locking happening that there is no
> possible way to have more than a few things happening on a single OSD
> (or perhaps a single placement group).
> 
> Has anyone thought about the problem from this angle?  This would help
> explain why multiple-OSDs-per-SSD is so effective (even though the
> thought of doing this in production is utterly terrifying).


When researching this topic a few months back the below is what I found, HTH.  We’re planning to break up NVMe drives into multiple OSDs.  I don’t find this terrifying so much as somewhat awkward, we’ll have to update deployment and troubleshooting/maintenance procedures to act accordingly. 

Back in the day it was conventional Ceph wisdom to never put multiple OSDs on a single device, but my sense was that was an artifact of bottlenecked spinners.  The resultant seek traffic I imagine could be ugly, but would it be worse than we already suffered with colo journals?  (*)  With a device that can handle lots of IO depth without seeks, IMHO it’s not so bad, especially as Ceph has evolved to cope better with larger numbers of OSDs.


"per-osd session lock", "all AIO completions are fired from a single thread – so even if you are pumping data to the OSDs using 8 threads, you are only getting serialized completions”

https://apawel.me/ceph-creating-multiple-osds-on-nvme-devices-luminous/

https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9200_max_ceph_12,-d-,2,-d-,8_luminous_bluestore_reference_architecture.pdf?la=en

https://www.spinics.net/lists/ceph-devel/msg41570.html

https://bugzilla.redhat.com/show_bug.cgi?id=1541415

http://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments#NVMe-SSD-partitioning


> With block sizes 64K and lower the avgqu-sz value never went above 1
> under any workload, and I never saw the iostat util% much above 50%.


I’ve been told that iostat %util isn’t as meaningful with SSDs as it was with HDDs, but I don’t recall the rationale.  ymmv.




*  And ohhhhh did we suffer from them :-x

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux