Re: Recovery of OMAP keys

Benoît Knecht <bknecht@xxxxxxxxxxxxx> · Wed, 22 Jun 2022 14:28:28 +0000

Hi Anthony,

On Tue, Jun 21, 2022 at 04:13:01PM -0700, Anthony D'Atri wrote:
> Have you tried deleting and redeploying such an OSD?  With RGW I’ve seen
> backfill be much faster than recovery.

Thanks for the suggestion, but no, I haven't tried it. The problem is that it's
not a specific OSD that mishaves from time to time, it's every OSD that happens
to contain these OMAP objects, which in our case is every SSD OSD. So every
time we reboot a server (each server has at least one SSD OSD), the cluster
takes ages to recover, even if the reboot itself was quite fast.

But the fact that it's faster to backfill than to recover kind of confirms my
suspicion: recovery deletes existing keys, and then recreates them, which
creates a lot of tombstones, which is slow on 16.2.7; on the other hand,
backfill only needs to create the keys, so no tombstones.

The way I understand it, this is what happens to an object A has a bunch of
OMAP keys, and has three replicas, A1, A2 and A3:

1. The OSD that contains A3 is stopped;
2. As little as one OMAP key is added to A, which modifies A1 and A3;
3. The OSD that contains A3 starts, A3 doesn't match A1 and A2 anymore, so the
   entire replica A3 is marked for recovery;
4. The recovery process deletes A3, which is then recreated based on A1 or A2.

That last step is the part that is less than ideal for OMAP. If A had 100k
keys, there's 100k key-value pairs in RocksDB associated with A3, when it gets
deleted it creates 100k tombstones to delete the keys, and then 100k identical
key-value pairs get created (plus a few keys that were added to A in the
meantime).

Can anyone confirm that this is what's happening? Is it also the case in Quincy
or on master?

Cheers,

--
Ben

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx