Re: pgs stuck in 'incomplete' state, blocked ops, query command hangs

Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx> · Wed, 22 Oct 2014 11:03:46 -0700

When I had PGs stuck with down_osds_we_would_probe, there was no way I could convince Ceph to give up on the data while those OSDs were down.
I tried ceph osd lost, ceph pg mark_unfound_lost, ceph pg force_create_pg.  None of them would do anything.

I eventually re-formatted the down OSD, and brought it back online.  It started backfilling, and the down_osds_we_would_probe emptied out.  Once that happened, ceph pg force_create_pg finally worked.  It didn't work right away though... if I recall, the PGs went into the creating state, and stayed there for many hours.  They finally created when another OSD restarted.

On Tue, Oct 21, 2014 at 9:59 AM, Lincoln Bryant <lincolnb@xxxxxxxxxxxx> wrote:
A small update on this, I rebooted all of the Ceph nodes and was able to then query one of the misbehaving pgs.

I've attached the query for pg 2.525.

There are some things like this in the peer info:

              "up": [],

              "acting": [],

              "up_primary": -1,

              "acting_primary": -1},

I also see things like:

          "down_osds_we_would_probe": [

                85],

But I don't have an OSD 85:

        85      3.64                    osd.85  DNE

# ceph osd rm osd.85

osd.85 does not exist.

# ceph osd lost 85 --yes-i-really-mean-it

osd.85 is not down or doesn't exist

Any help would be greatly appreciated.

Thanks,

Lincoln

On Oct 21, 2014, at 9:39 AM, Lincoln Bryant wrote:

> Hi cephers,

>

> We have two pgs that are stuck in 'incomplete' state across two different pools:

> pg 2.525 is stuck inactive since forever, current state incomplete, last acting [55,89]

> pg 0.527 is stuck inactive since forever, current state incomplete, last acting [55,89]

> pg 0.527 is stuck unclean since forever, current state incomplete, last acting [55,89]

> pg 2.525 is stuck unclean since forever, current state incomplete, last acting [55,89]

> pg 0.527 is incomplete, acting [55,89]

> pg 2.525 is incomplete, acting [55,89]

>

> Basically, we ran into a problem where we had 2x replication and 2 disks on different machines died near-simultaneously, and my pgs were stuck in 'down+peering'. I had to do some combination of declaring the OSDs as lost, and running 'force_create_pg'. I realize the data on those pgs is now lost, but I'm stuck as to how to get the pgs out of 'incomplete'.

>

> I also see many ops blocked on the primary OSD for these:

> 100 ops are blocked > 67108.9 sec

> 100 ops are blocked > 67108.9 sec on osd.55

>

> However, this is a new disk. If I 'ceph osd out osd.55', the pgs move to another OSD and the new primary gets blocked ops. Restarting osd.55 does nothing. Other pgs on osd.55 seem okay.

>

> I would attach the result of a query, but If I run a 'ceph pg 2.525 query', the command totally hangs until I ctrl-c

>

> ceph pg 2.525 query

> ^CError EINTR: problem getting command descriptions from pg.2.525

>

> I've also tried 'ceph pg repair 2.525', which does nothing.

>

> Any thoughts here? Are my pools totally sunk?

>

> Thanks,

> Lincoln

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com