Den fre 13 maj 2022 kl 08:56 skrev Stefan Kooman <stefan@xxxxxx>: > >> > > Thanks Janne and all for the insights! The reason why I half-jokingly > > suggested the cluster 'lost interest' in those last few fixes is that > > the recovery statistics' included in ceph -s reported near to zero > > activity for so long. After a long while those last few 'were fixed' > > --- but if the cluster was moving metadata around to fix the 'holdout > > repairs' that traffic wasn't in the stats. Those last few objects/pgs > > to be repaired seemingly got fixed 'by magic that didn't include moving > > data counted in the ceph -s stats'. > > It's probably the OMAP data (lots of key-values) that takes a lot of > time to replicate (We have PGs with over 4 million of objects with just > OMAP) and those can take up to 45 minutes to recover all while doing a > little bit of network throughput (those are NVMe OSDs). You can check > this with "watch -n 3 ceph pg ls remapped" and see how long each > backfill takes. And also if it has a lot of OMAP_BYTES and OMAP_KEYS ... > but no "BYTES". Yes, RGW does (or did) place a lot of zero-sized objects in some pools with tons of metadata attached to each 0 bytes object as a placeholder for said data. While recovering such PGs on spin drives, the number of metadata-per-second one can recover is probably bound by spindrive IOPS limits at some point, and as Stefan says, the bytes-per-second looks abysmal because it does a very simple calculation that doesn't take these kinds of objects into account. So for example, if a zero-sized obj would have 100 metadata "things" attached to it, and a normal spindrive can do 100 IOPS, ceph -s would tell me I am fixing one object per second at the incredible rate of 0b/s. That would indeed make it look like the cluster "doesn't care anymore" even if the destination drives are flipping the write head back and forth 100 times per second as fast as it physically can, probably showing near 100% utilization in iostat and similar tools. But the helicopter view summary line on recovery speed looks like the cluster doesn't want to finish repairs... -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx