PGs stuck creating

brak@xxxxxxxxxxxxxxx (Brian Rak) · Fri, 08 Aug 2014 17:04:05 -0400

Ahh figured it out.  I hadn't removed the dead OSDs from the crush map, 
which was apparently confusing ceph.

I just did 'ceph osd crush rm XXX' for all of them, restarted all the 
online OSDs, and the pg got created!

On 8/8/2014 4:51 PM, Brian Rak wrote:
> ceph version 0.80.4 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)
>
> I recently managed to cause some problems for one of our clusters, we 
> had 1/3 of the OSDs fail and lose all the data.
>
> I removed all the failed OSDs from the crush map, and did 'ceph osd 
> rm'.  Once it finished recovering, I was left with a whole bunch of 
> 'stale+active+clean' PGs.  These had been hosted entirely on the OSDs 
> that failed.
>
> So, there will be some data loss here.  Luckily the majority of the 
> data is easily replaceable.  I couldn't do a whole lot with these PGs, 
> so I ended up forcing ceph to recreate them, with:
>
> ceph health detail | grep pg | awk '{ print $2 }'  | xargs -n1 ceph pg 
> force_create_pg
>
> This fixed most of them, though I'm now left with one that's hanging 
> on 'creating'.  Any suggestions for what I can do?  There isn't any 
> data to lose in this pg, so I would be okay removing it, but I don't 
> see any way to do that.  How can I force the OSD to create it again?
>
>     cluster e312b58c-0391-43d0-98e6-25a41bea6a70
>      health HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean
>      monmap e3: 3 mons at {snip}, election epoch 50, quorum 0,1,2 {snip}
>      osdmap e3922: 11 osds: 11 up, 11 in
>       pgmap v1261502: 4722 pgs, 14 pools, 4344 GB data, 3314 kobjects
>             8668 GB used, 11803 GB / 20472 GB avail
>                    1 creating
>                 4721 active+clean
>   client io 449 kB/s rd, 0 B/s wr, 643 op/s
>
> # ceph pg dump | grep creating
> dumped all in format plain
> 3.15c   0       0       0       0       0       0       0 
> creating        2014-08-08 16:18:38.781245      0'0     0:0 [4,2]   
> 4       [2,4]   2       0'0     0.000000        0'0 0.000000
>
> # ceph pg 3.15c query
> Error ENOENT: i don't have pgid 3.15c
>
> # ceph pg 3.15c mark_unfound_lost revert
> Error ENOENT: i don't have pgid 3.15c
>
> If I try to force a scrub:
>
> 2014-08-08 16:41:38.016388 7f33270cd700  0 osd.2 3926 do_command r=0
> 2014-08-08 16:41:39.775253 7f33270cd700  0 osd.2 3926 do_command r=0
> 2014-08-08 16:41:42.491501 7f33270cd700  0 osd.2 3926 do_command r=0
> 2014-08-08 16:41:42.497906 7f33270cd700  0 osd.2 3926 do_command r=-2 
> i don't have pgid 3.15c
> 2014-08-08 16:41:42.497911 7f33270cd700  0 log [INF] : i don't have 
> pgid 3.15c
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com