pg incomplete state

John-Paul Robinson <jpr@xxxxxxx> · Tue, 20 Oct 2015 09:22:54 -0500

Hi folks

I've been rebuilding drives in my cluster to add space.  This has gone
well so far.

After the last batch of rebuilds, I'm left with one placement group in
an incomplete state.

[sudo] password for jpr:
HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
pg 3.ea is stuck inactive since forever, current state incomplete, last
acting [30,11]
pg 3.ea is stuck unclean since forever, current state incomplete, last
acting [30,11]
pg 3.ea is incomplete, acting [30,11]

I've restarted both OSD a few times but it hasn't cleared the error.

On the primary I see errors in the log related to slow requests:

2015-10-20 08:40:36.678569 7f361585c700  0 log [WRN] : 8 slow requests,
3 included below; oldest blocked for > 31.922487 secs
2015-10-20 08:40:36.678580 7f361585c700  0 log [WRN] : slow request
31.531606 seconds old, received at 2015-10-20 08:40:05.146902:
osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.00000000a044 [read
1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg
2015-10-20 08:40:36.678592 7f361585c700  0 log [WRN] : slow request
31.531591 seconds old, received at 2015-10-20 08:40:05.146917:
osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.00000000a044 [read
2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg
2015-10-20 08:40:36.678599 7f361585c700  0 log [WRN] : slow request
31.531551 seconds old, received at 2015-10-20 08:40:05.146957:
osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0]
3.e4bd50ea) v4 currently reached pg

Note's online suggest this is an issue with the journal and that it may
be possible to export and rebuild thepg.  I don't have firefly.

https://ceph.com/community/incomplete-pgs-oh-my/

Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary)
but missing entirely on osd.30 (the primary). 

on osd.33 (primary):

crowbar@da0-36-9f-0e-2b-88:~$ du -sk
/var/lib/ceph/osd/ceph-30/current/3.ea_head/
0       /var/lib/ceph/osd/ceph-30/current/3.ea_head/

on osd.11 (secondary):

crowbar@da0-36-9f-0e-2b-40:~$ du -sh
/var/lib/ceph/osd/ceph-11/current/3.ea_head/                                                            
63G     /var/lib/ceph/osd/ceph-11/current/3.ea_head/

This makes some sense since, my disk drive rebuilding activity
reformatted the primary osd.30.  It also gives me some hope that my data
is not lost.

I understand incomplete means problem with journal, but is there a way
to dig deeper into this or possible to get the secondary's data to take
over?

Thanks,

~jpr

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com