Re: Squid: deep scrub issues

Laimis Juzeliūnas <laimis.juzeliunas@xxxxxxxxxx> · Thu, 05 Dec 2024 22:09:37 -0000

Hi all,

Just came back from this years Cephalocon and managed to get a quick chat with Ronen regarding this issue. He had a great presentation[1, 2] on the upcoming changes to scrubbing in Tentacle as well as some changes already made in Squid release.
The primary suspect here is the mclock scheduler and the way replica reservations are made with since 19.2.0. Regular scrubs begin by the primary requesting all acting-set replicas to allow the scrub to continue, each replica either grants the request immediately or queues it. As I understand previous releases instead of queuing would send a simple deny on the spot in case of thinned resources (that happens when the scrub map is asked for from the acting set members, but I might be wrong). For some reason with mclock this can lead to acting sets constantly queuing these scrub requests and never actually completing.
As for the configuraiton goes: in Squid osd_scrub_cost config that has been increased to 52428800 for some reason. I'm having a hard time finding previous values but [3] redhat docs have this value set at 50 << 20. Unless the whole logic/calculation has changed such an abyssmal value will simply never allow resources to be granted with mclock.
Another suspect is osd_scrub_event_cost which has been set to 4096. Once again having a hard time to find any previous version values for it to compare.

One thing we've found that there is now a config option osd_scrub_disable_reservation_queuing (default - false): "When set - scrub replica reservations are responded to immediately, with either success or failure (the pre-Squid version behaviour). This configuration option is introduced to support mixed-version clusters and debugging, and will be removed in the next release." My guess is that setting this to true would simply return scrubbing options back to Reef and previous releases.

To keep all the work done with scrubbing changes in place we will try reducing osd_scrub_cost to a much lower value (50 or even less) and check if that helps our case. If not, we will reduce osd_scrub_event_cost as well as we're not sure at this point which one of these have the direct impact. 
If that wont help we will have to set osd_scrub_disable_reservation_queuing to true, but that will leave us simply with an old way scrubs are done (not cool - we want the fancy new way). If that wont help we will have to start thinking of switching to wpq instead of mclock, which is also not that cool looking into the future of Ceph. 

I'll keep the mailing list (and tracker) updated with our findings.

Best,
Laimis J.

1 - https://ceph2024.sched.com/event/1ktWh/the-scrub-type-to-limitations-matrix-ronen-friedman-ibm
2 - https://static.sched.com/hosted_files/ceph2024/08/ceph24_main%20%284%29.pdf
3 - https://docs.redhat.com/en/documentation/red_hat_ceph_storage/2/html/configuration_guide/osd_configuration_reference#scrubbing
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx