ceph pg query hangs for ever

Wido den Hollander <wido@xxxxxxxx> · Wed, 30 Mar 2016 20:30:23 +0200 (CEST)

Hi,

I have an issue with a Ceph cluster which I can't resolve.

Due to OSD failure a PG is incomplete, but I can't query the PG to see what I
can do to fix it.

     health HEALTH_WARN
            1 pgs incomplete
            1 pgs stuck inactive
            1 pgs stuck unclean
            98 requests are blocked > 32 sec

$ ceph pg 3.117 query

That will hang for ever.

$ ceph pg dump_stuck

pg_stat	state	up	up_primary	acting	acting_primary
3.117	incomplete	[68,55,74]	68	[68,55,74]	68

The primary PG in this case is osd.68 . If I stop the OSD the PG query works,
but it says that bringing osd 68 back online will probably help.

The 98 requests which are blocked are also on osd.68 and they all say:

They all say:
- initiated
- reached_pg

The cluster is running Hammer 0.94.5 in this case.

>From what I know a OSD had a failing disk and was restarted a couple of times
while the disk gave errors. This caused the PG to become incomplete.

I've set debug osd to 20, but I can't really tell what is going wrong on osd.68
which causes it to stall this long.

Any idea what to do here to get this PG up and running again?

Wido
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com