Re: The last 15 'degraded' items take as many hours as the first 15K?

Janne Johansson <icepic.dz@xxxxxxxxx> · Fri, 13 May 2022 09:41:50 +0200

Den fre 13 maj 2022 kl 08:56 skrev Stefan Kooman <stefan@xxxxxx>:
> >>
> > Thanks Janne and all for the insights!  The reason why I half-jokingly
> > suggested the cluster 'lost interest' in those last few fixes is that
> > the recovery statistics' included in ceph -s reported near to zero
> > activity for so long.  After a long while those last few 'were fixed'
> > --- but if the cluster was moving metadata around to fix the 'holdout
> > repairs' that traffic wasn't in the stats.  Those last few objects/pgs
> > to be repaired seemingly got fixed 'by magic that didn't include moving
> > data counted in the ceph -s stats'.
>
> It's probably the OMAP data (lots of key-values) that takes a lot of
> time to replicate (We have PGs with over 4 million of objects with just
> OMAP) and those can take up to 45 minutes to recover all while doing a
> little bit of network throughput (those are NVMe OSDs). You can check
> this with "watch -n 3 ceph pg ls remapped" and see how long each
> backfill takes. And also if it has a lot of OMAP_BYTES and OMAP_KEYS ...
> but no "BYTES".

Yes, RGW does (or did) place a lot of zero-sized objects in some pools
with tons of metadata attached to each 0 bytes object as a placeholder
for said data. While recovering such PGs on spin drives, the number of
metadata-per-second one can recover is probably bound by spindrive
IOPS limits at some point, and as Stefan says, the bytes-per-second
looks abysmal because it does a very simple calculation that doesn't
take these kinds of objects into account.

So for example, if a zero-sized obj would have 100 metadata "things"
attached to it, and a normal spindrive can do 100 IOPS, ceph -s would
tell me I am fixing one object per second at the incredible rate of
0b/s. That would indeed make it look like the cluster "doesn't care
anymore" even if the destination drives are flipping the write head
back and forth 100 times per second as fast as it physically can,
probably showing near 100% utilization in iostat and similar tools.
But the helicopter view summary line on recovery speed looks like the
cluster doesn't want to finish repairs...

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx