Re: objects degraded higher than 100%

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 12 Oct 2017 17:22:57 +0000

On Thu, Oct 12, 2017 at 3:50 AM Florian Haas <florian@xxxxxxxxxxx> wrote:
On Mon, Sep 11, 2017 at 8:13 PM, Andreas Herrmann <andreas@xxxxxxxx> wrote:

> Hi,

>

> how could this happen:

>

>         pgs: 197528/1524 objects degraded (12961.155%)

>

> I did some heavy failover tests, but a value higher than 100% looks strange

> (ceph version 12.2.0). Recovery is quite slow.

>

>   cluster:

>     health: HEALTH_WARN

>             3/1524 objects misplaced (0.197%)

>             Degraded data redundancy: 197528/1524 objects degraded

> (12961.155%), 1057 pgs unclean, 1055 pgs degraded, 3 pgs undersized

>

>   data:

>     pools:   1 pools, 2048 pgs

>     objects: 508 objects, 1467 MB

>     usage:   127 GB used, 35639 GB / 35766 GB avail

>     pgs:     197528/1524 objects degraded (12961.155%)

>              3/1524 objects misplaced (0.197%)

>              1042 active+recovery_wait+degraded

>              991  active+clean

>              8    active+recovering+degraded

>              3    active+undersized+degraded+remapped+backfill_wait

>              2    active+recovery_wait+degraded+remapped

>              2    active+remapped+backfill_wait

>

>   io:

>     recovery: 340 kB/s, 80 objects/s

Did you ever get to the bottom of this? I'm seeing something very

similar on a 12.2.1 reference system:

https://gist.github.com/fghaas/f547243b0f7ebb78ce2b8e80b936e42c

I'm also seeing an unusual MISSING_ON_PRIMARY count in "rados df":

https://gist.github.com/fghaas/59cd2c234d529db236c14fb7d46dfc85

The odd thing in there is that the "bench" pool was empty when the

recovery started (that pool had been wiped with "rados cleanup"), so

the number of objects deemed to be missing from the primary really

ought to be zero.

It seems like it's considering these deleted objects to still require

replication, but that sounds rather far fetched to be honest.

Actually, that makes some sense. This cluster had an OSD down while (some of) the deletes were happening?

I haven't dug through the code but I bet it is considering those as degraded objects because the out-of-date OSD knows it doesn't have the latest versions on them! :)
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com