Hi there, With the help of a lot of people we were able to repair the PG and restored service. We will get back on this later with a full report for future reference. Regards, Mart On 03/30/2016 08:30 PM, Wido den Hollander wrote: > Hi, > > I have an issue with a Ceph cluster which I can't resolve. > > Due to OSD failure a PG is incomplete, but I can't query the PG to see what I > can do to fix it. > > health HEALTH_WARN > 1 pgs incomplete > 1 pgs stuck inactive > 1 pgs stuck unclean > 98 requests are blocked > 32 sec > > $ ceph pg 3.117 query > > That will hang for ever. > > $ ceph pg dump_stuck > > pg_stat state up up_primary acting acting_primary > 3.117 incomplete [68,55,74] 68 [68,55,74] 68 > > The primary PG in this case is osd.68 . If I stop the OSD the PG query works, > but it says that bringing osd 68 back online will probably help. > > The 98 requests which are blocked are also on osd.68 and they all say: > > They all say: > - initiated > - reached_pg > > The cluster is running Hammer 0.94.5 in this case. > > From what I know a OSD had a failing disk and was restarted a couple of times > while the disk gave errors. This caused the PG to become incomplete. > > I've set debug osd to 20, but I can't really tell what is going wrong on osd.68 > which causes it to stall this long. > > Any idea what to do here to get this PG up and running again? > > Wido > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com