PGs stuck creating

brak@xxxxxxxxxxxxxxx (Brian Rak) · Fri, 08 Aug 2014 16:51:51 -0400

ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)

I recently managed to cause some problems for one of our clusters, we 
had 1/3 of the OSDs fail and lose all the data.

I removed all the failed OSDs from the crush map, and did 'ceph osd 
rm'.  Once it finished recovering, I was left with a whole bunch of 
'stale+active+clean' PGs.  These had been hosted entirely on the OSDs 
that failed.

So, there will be some data loss here.  Luckily the majority of the data 
is easily replaceable.  I couldn't do a whole lot with these PGs, so I 
ended up forcing ceph to recreate them, with:

ceph health detail | grep pg | awk '{ print $2 }'  | xargs -n1 ceph pg 
force_create_pg

This fixed most of them, though I'm now left with one that's hanging on 
'creating'.  Any suggestions for what I can do?  There isn't any data to 
lose in this pg, so I would be okay removing it, but I don't see any way 
to do that.  How can I force the OSD to create it again?

     cluster e312b58c-0391-43d0-98e6-25a41bea6a70
      health HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean
      monmap e3: 3 mons at {snip}, election epoch 50, quorum 0,1,2 {snip}
      osdmap e3922: 11 osds: 11 up, 11 in
       pgmap v1261502: 4722 pgs, 14 pools, 4344 GB data, 3314 kobjects
             8668 GB used, 11803 GB / 20472 GB avail
                    1 creating
                 4721 active+clean
   client io 449 kB/s rd, 0 B/s wr, 643 op/s

# ceph pg dump | grep creating
dumped all in format plain
3.15c   0       0       0       0       0       0       0 
creating        2014-08-08 16:18:38.781245      0'0     0:0 [4,2]   
4       [2,4]   2       0'0     0.000000        0'0 0.000000

# ceph pg 3.15c query
Error ENOENT: i don't have pgid 3.15c

# ceph pg 3.15c mark_unfound_lost revert
Error ENOENT: i don't have pgid 3.15c

If I try to force a scrub:

2014-08-08 16:41:38.016388 7f33270cd700  0 osd.2 3926 do_command r=0
2014-08-08 16:41:39.775253 7f33270cd700  0 osd.2 3926 do_command r=0
2014-08-08 16:41:42.491501 7f33270cd700  0 osd.2 3926 do_command r=0
2014-08-08 16:41:42.497906 7f33270cd700  0 osd.2 3926 do_command r=-2 i 
don't have pgid 3.15c
2014-08-08 16:41:42.497911 7f33270cd700  0 log [INF] : i don't have pgid 
3.15c