Re: ceph pg query hangs for ever

Mart van Santen <mart@xxxxxxxxxxxx> · Wed, 30 Mar 2016 23:36:42 +0200

Hi there,

With the help of a lot of people we were able to repair the PG and
restored service. We will get back on this later with a full report for
future reference.

Regards,

Mart

On 03/30/2016 08:30 PM, Wido den Hollander wrote:
> Hi,
>
> I have an issue with a Ceph cluster which I can't resolve.
>
> Due to OSD failure a PG is incomplete, but I can't query the PG to see what I
> can do to fix it.
>
>      health HEALTH_WARN
>             1 pgs incomplete
>             1 pgs stuck inactive
>             1 pgs stuck unclean
>             98 requests are blocked > 32 sec
>
> $ ceph pg 3.117 query
>
> That will hang for ever.
>
> $ ceph pg dump_stuck
>
> pg_stat	state	up	up_primary	acting	acting_primary
> 3.117	incomplete	[68,55,74]	68	[68,55,74]	68
>
> The primary PG in this case is osd.68 . If I stop the OSD the PG query works,
> but it says that bringing osd 68 back online will probably help.
>
> The 98 requests which are blocked are also on osd.68 and they all say:
>
> They all say:
> - initiated
> - reached_pg
>
> The cluster is running Hammer 0.94.5 in this case.
>
> From what I know a OSD had a failing disk and was restarted a couple of times
> while the disk gave errors. This caused the PG to become incomplete.
>
> I've set debug osd to 20, but I can't really tell what is going wrong on osd.68
> which causes it to stall this long.
>
> Any idea what to do here to get this PG up and running again?
>
> Wido
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com