PG stuck creating

greg@xxxxxxxxxxx (Gregory Farnum) · Tue, 30 Sep 2014 09:36:57 -0700

On Tuesday, September 30, 2014, Robert LeBlanc <robert at leblancnet.us> wrote:

> On our dev cluster, I've got a PG that won't create. We had a host fail
> with 10 OSDs that needed to be rebuilt. A number of other OSDs were down
> for a few days (did I mention this was a dev cluster?). The other OSDs
> eventually came up once the OSD maps caught up on them. I rebuilt the OSDs
> on all the hosts because we were running into XFS lockups with bcache.
> There were a number of PGs that could not be found when all the hosts were
> rebuilt. I tried restarting all the OSDs, the MONs, and deep scrubbing the
> OSDs they were on as well as the PGs. I performed a repair on the OSDs as
> well without any luck. One of pools had a recommendation to increase the
> PGs, so I increased it thinking it might be able to help.
>
> Nothing was helping and I could not find any reference to them so I force
> created them. That cleared up all but one that is creating due to the new
> PG number. Now, there is nothing I can do to unstick this one PG, I can't
> force create it, I can't increase the pgp_num, nada. At one point when
> recreating the OSDs, some of the number got out of order and to calm my
> OCD, I "fixed" it requiring me to manually modify the CRUSH map as the OSD
> appeared in both hosts, this was before I increased the PGs.
>
> There is nothing critical on this cluster, but I'm using this as an
> opportunity to understand Ceph in case we run into something similar in our
> future production environment.
>
> HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; pool libvirt-pool
> pg_num 256 > pgp_num 128
> pg 4.bf is stuck inactive since forever, current state creating, last
> acting [29,15,32]
> pg 4.bf is stuck unclean since forever, current state creating, last
> acting [29,15,32]
> pool libvirt-pool pg_num 256 > pgp_num 128
> [root at nodea ~]# ceph-osd --version
> ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)
>
> More output http://pastebin.com/ajgpU7Zx
>
> Thanks
>

You should find out which OSD the PG maps to, and see if "ceph pg query" or
the osd admin socket will expose anything useful about its state.
-Greg

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140930/c2af8527/attachment.htm>