On 11-09-15 12:22, Gregory Farnum wrote: > On Thu, Sep 10, 2015 at 9:46 PM, Wido den Hollander <wido@xxxxxxxx> wrote: >> Hi, >> >> I'm running into a issue with Ceph 0.94.2/3 where after doing a recovery >> test 9 PGs stay incomplete: >> >> osdmap e78770: 2294 osds: 2294 up, 2294 in >> pgmap v1972391: 51840 pgs, 7 pools, 220 TB data, 185 Mobjects >> 755 TB used, 14468 TB / 15224 TB avail >> 51831 active+clean >> 9 incomplete >> >> As you can see, all 2294 OSDs are online and about all PGs became >> active+clean again, except for 9. >> >> I found out that these PGs are the problem: >> >> 10.3762 >> 7.309e >> 7.29a2 >> 10.2289 >> 7.17dd >> 10.165a >> 7.1050 >> 7.c65 >> 10.abf >> >> Digging further, all the PGs map back to a OSD which is running on the >> same host. 'ceph-stg-01' in this case. >> >> $ ceph pg 10.3762 query >> >> Looking at the recovery state, this is shown: >> >> { >> "first": 65286, >> "last": 67355, >> "maybe_went_rw": 0, >> "up": [ >> 1420, >> 854, >> 1105 >> ], >> "acting": [ >> 1420 >> ], >> "primary": 1420, >> "up_primary": 1420 >> }, >> >> osd.1420 is online. I tried restarting it, but nothing happens, these 9 >> PGs stay incomplete. >> >> Under 'peer_info' info I see both osd.854 and osd.1105 reporting about >> the PG with identical numbers. >> >> I restarted both 854 and 1105, without result. >> >> The output of PG query can be found here: http://pastebin.com/qQL699zC > > Hmm. The pg query results from each peer aren't quite the same but > look largely consistent to me. I think somebody from the RADOS team > will need to check it out. I do see that the log tail on the primary > hasn't advanced as far as the other peers have, but I'm not sure if > that's the OSD being responsible or evidence of the root cause... > -Greg > That's what I noticed as well. I ran osd.1420 with debug osd/filestore = 20 and the output is here: http://ceph.o.auroraobjects.eu/tmp/txc1-osd.1420.log.gz I can't tell what is going on, I don't see any 'errors', but that's probably me not being able to diagnose the logs properly. >> >> The cluster is running a mix of 0.94.2 and .3 on Ubuntu 14.04.2 with the >> 3.13 kernel. XFS is being used as the backing filesystem. >> >> Any suggestions to fix this issue? There is no valuable data in these >> pools, so I can remove them, but I'd rather fix the root-cause. >> >> -- >> Wido den Hollander >> 42on B.V. >> Ceph trainer and consultant >> >> Phone: +31 (0)20 700 9902 >> Skype: contact42on >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com