Recovery of OMAP keys

Benoît Knecht <bknecht@xxxxxxxxxxxxx> · Tue, 21 Jun 2022 12:00:22 +0000

Hi,

Our cluster has an SSD pool that contains empty objects with about 100k OMAP
keys each (similar to the rgw index pool).

If we restart one of the associated SSD OSDs while writing just a few OMAP keys
to the cluster, I've noticed that PGs take a very long time to recover, and
`ceph status` shows 200k+ keys/s being recovered, despite only maybe a couple
thousands new keys having been created.

```
recovery: 0 B/s, 268.65k keys/s, 2 objects/s
```

What seems to be happening (but I'd love confirmation on that from a developer)
is that any PG that was "tainted" while the OSD was restarting get marked for
recovery, and then instead of just adding the missing keys, existing keys are
deleted and recreated. What makes me think that a large number of keys are
being deleted is that we're affected by https://tracker.ceph.com/issues/55324
(we're still running 16.2.7), and after the recovery finishes, we do see slow
ops caused by tombstones, and the only way to fix it is to compact the OSD.

Can someone confirm that it's really what's happening? Is this the expected
behavior, or is there a way to make OMAP recovery more efficient?

Cheers,

--
Ben

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx