On Mon, Sep 11, 2017 at 8:13 PM, Andreas Herrmann <andreas@xxxxxxxx> wrote: > Hi, > > how could this happen: > > pgs: 197528/1524 objects degraded (12961.155%) > > I did some heavy failover tests, but a value higher than 100% looks strange > (ceph version 12.2.0). Recovery is quite slow. > > cluster: > health: HEALTH_WARN > 3/1524 objects misplaced (0.197%) > Degraded data redundancy: 197528/1524 objects degraded > (12961.155%), 1057 pgs unclean, 1055 pgs degraded, 3 pgs undersized > > data: > pools: 1 pools, 2048 pgs > objects: 508 objects, 1467 MB > usage: 127 GB used, 35639 GB / 35766 GB avail > pgs: 197528/1524 objects degraded (12961.155%) > 3/1524 objects misplaced (0.197%) > 1042 active+recovery_wait+degraded > 991 active+clean > 8 active+recovering+degraded > 3 active+undersized+degraded+remapped+backfill_wait > 2 active+recovery_wait+degraded+remapped > 2 active+remapped+backfill_wait > > io: > recovery: 340 kB/s, 80 objects/s Did you ever get to the bottom of this? I'm seeing something very similar on a 12.2.1 reference system: https://gist.github.com/fghaas/f547243b0f7ebb78ce2b8e80b936e42c I'm also seeing an unusual MISSING_ON_PRIMARY count in "rados df": https://gist.github.com/fghaas/59cd2c234d529db236c14fb7d46dfc85 The odd thing in there is that the "bench" pool was empty when the recovery started (that pool had been wiped with "rados cleanup"), so the number of objects deemed to be missing from the primary really ought to be zero. It seems like it's considering these deleted objects to still require replication, but that sounds rather far fetched to be honest. Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com