Hi Anthony, On Tue, Jun 21, 2022 at 04:13:01PM -0700, Anthony D'Atri wrote: > Have you tried deleting and redeploying such an OSD? With RGW I’ve seen > backfill be much faster than recovery. Thanks for the suggestion, but no, I haven't tried it. The problem is that it's not a specific OSD that mishaves from time to time, it's every OSD that happens to contain these OMAP objects, which in our case is every SSD OSD. So every time we reboot a server (each server has at least one SSD OSD), the cluster takes ages to recover, even if the reboot itself was quite fast. But the fact that it's faster to backfill than to recover kind of confirms my suspicion: recovery deletes existing keys, and then recreates them, which creates a lot of tombstones, which is slow on 16.2.7; on the other hand, backfill only needs to create the keys, so no tombstones. The way I understand it, this is what happens to an object A has a bunch of OMAP keys, and has three replicas, A1, A2 and A3: 1. The OSD that contains A3 is stopped; 2. As little as one OMAP key is added to A, which modifies A1 and A3; 3. The OSD that contains A3 starts, A3 doesn't match A1 and A2 anymore, so the entire replica A3 is marked for recovery; 4. The recovery process deletes A3, which is then recreated based on A1 or A2. That last step is the part that is less than ideal for OMAP. If A had 100k keys, there's 100k key-value pairs in RocksDB associated with A3, when it gets deleted it creates 100k tombstones to delete the keys, and then 100k identical key-value pairs get created (plus a few keys that were added to A in the meantime). Can anyone confirm that this is what's happening? Is it also the case in Quincy or on master? Cheers, -- Ben _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx