pgs stuck in 'incomplete' state, blocked ops, query command hangs

Lincoln Bryant <lincolnb@xxxxxxxxxxxx> · Tue, 21 Oct 2014 09:39:33 -0500

Hi cephers,

We have two pgs that are stuck in 'incomplete' state across two different pools: 
pg 2.525 is stuck inactive since forever, current state incomplete, last acting [55,89]
pg 0.527 is stuck inactive since forever, current state incomplete, last acting [55,89]
pg 0.527 is stuck unclean since forever, current state incomplete, last acting [55,89]
pg 2.525 is stuck unclean since forever, current state incomplete, last acting [55,89]
pg 0.527 is incomplete, acting [55,89]
pg 2.525 is incomplete, acting [55,89]

Basically, we ran into a problem where we had 2x replication and 2 disks on different machines died near-simultaneously, and my pgs were stuck in 'down+peering'. I had to do some combination of declaring the OSDs as lost, and running 'force_create_pg'. I realize the data on those pgs is now lost, but I'm stuck as to how to get the pgs out of 'incomplete'. 

I also see many ops blocked on the primary OSD for these:
100 ops are blocked > 67108.9 sec
100 ops are blocked > 67108.9 sec on osd.55

However, this is a new disk. If I 'ceph osd out osd.55', the pgs move to another OSD and the new primary gets blocked ops. Restarting osd.55 does nothing. Other pgs on osd.55 seem okay.

I would attach the result of a query, but If I run a 'ceph pg 2.525 query', the command totally hangs until I ctrl-c

ceph pg 2.525 query
^CError EINTR: problem getting command descriptions from pg.2.525

I've also tried 'ceph pg repair 2.525', which does nothing.

Any thoughts here? Are my pools totally sunk? 

Thanks,
Lincoln
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com