Re: 9 PGs stay incomplete

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 11 Sep 2015 11:22:57 +0100



On Thu, Sep 10, 2015 at 9:46 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> Hi,
>
> I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery
> test 9 PGs stay incomplete:
>
> osdmap e78770: 2294 osds: 2294 up, 2294 in
> pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects
>        755 TB used, 14468 TB / 15224 TB avail
>           51831 active+clean
>               9 incomplete
>
> As you can see, all 2294 OSDs are online and about all PGs became
> active+clean again, except for 9.
>
> I found out that these PGs are the problem:
>
> 10.3762
> 7.309e
> 7.29a2
> 10.2289
> 7.17dd
> 10.165a
> 7.1050
> 7.c65
> 10.abf
>
> Digging further, all the PGs map back to a OSD which is running on the
> same host. 'ceph-stg-01' in this case.
>
> $ ceph pg 10.3762 query
>
> Looking at the recovery state, this is shown:
>
>                 {
>                     "first": 65286,
>                     "last": 67355,
>                     "maybe_went_rw": 0,
>                     "up": [
>                         1420,
>                         854,
>                         1105
>                     ],
>                     "acting": [
>                         1420
>                     ],
>                     "primary": 1420,
>                     "up_primary": 1420
>                 },
>
> osd.1420 is online. I tried restarting it, but nothing happens, these 9
> PGs stay incomplete.
>
> Under 'peer_info' info I see both osd.854 and osd.1105 reporting about
> the PG with identical numbers.
>
> I restarted both 854 and 1105, without result.
>
> The output of PG query can be found here: http://pastebin.com/qQL699zC

Hmm. The pg query results from each peer aren't quite the same but
look largely consistent to me. I think somebody from the RADOS team
will need to check it out. I do see that the log tail on the primary
hasn't advanced as far as the other peers have, but I'm not sure if
that's the OSD being responsible or evidence of the root cause...
-Greg

>
> The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the
> 3.13 kernel. XFS is being used as the backing filesystem.
>
> Any suggestions to fix this issue? There is no valuable data in these
> pools, so I can remove them, but I'd rather fix the root-cause.
>
> --
> Wido den Hollander
> 42on B.V.
> Ceph trainer and consultant
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com