Den tors 12 maj 2022 kl 00:03 skrev Harry G. Coin <hgcoin@xxxxxxxxx>: > Might someone explain why the count of degraded items can drop > thousands, sometimes tens of thousands in the same number of hours it > takes to go from 10 to 0? For example, when an OSD or a host with a few > OSD's goes offline for a while, reboots. > > Sitting at one complete and entire degraded object out of millions for > longer than it took to write this post. > > Seems the fewer the number of degraded objects, the less interested the > cluster is in fixing it! If (which is likely) different PGs take a different amount of time/IO to recover based on size, or amount of metadata attached to it and so on, then it would probably be the case that some of the PGs you see early on as part of the "35 PGs are backfilling" contain the slow ones but also the faster ones too, where the faster ones are replaced over as they finish. When all the easy work is done, only the slow ones remain, making it look like it waited until the end and then "don't want to work as hard on those as the first ones" when in fact the sum of work was always going to take a long time. (we had SMR drives on gig-eth boxes, when one of those crashed it took .. aaaages to fix). It's just that the easy parts pass by very fast due to the parallelism in the repairs, leaving you to see the hard parts but they were never equal to begin with. -- May the most significant bit of your life be positive. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx