PG stuck creating

robert@xxxxxxxxxxxxx (Robert LeBlanc) · Tue, 30 Sep 2014 11:07:01 -0600

I rebuilt the primary OSD (29) in the hopes it would unblock whatever it
was, but no luck. I'll check the admin socket and see if there is anything
I can find there.

On Tue, Sep 30, 2014 at 10:36 AM, Gregory Farnum <greg at inktank.com> wrote:

> On Tuesday, September 30, 2014, Robert LeBlanc <robert at leblancnet.us>
> wrote:
>
>> On our dev cluster, I've got a PG that won't create. We had a host fail
>> with 10 OSDs that needed to be rebuilt. A number of other OSDs were down
>> for a few days (did I mention this was a dev cluster?). The other OSDs
>> eventually came up once the OSD maps caught up on them. I rebuilt the OSDs
>> on all the hosts because we were running into XFS lockups with bcache.
>> There were a number of PGs that could not be found when all the hosts were
>> rebuilt. I tried restarting all the OSDs, the MONs, and deep scrubbing the
>> OSDs they were on as well as the PGs. I performed a repair on the OSDs as
>> well without any luck. One of pools had a recommendation to increase the
>> PGs, so I increased it thinking it might be able to help.
>>
>> Nothing was helping and I could not find any reference to them so I force
>> created them. That cleared up all but one that is creating due to the new
>> PG number. Now, there is nothing I can do to unstick this one PG, I can't
>> force create it, I can't increase the pgp_num, nada. At one point when
>> recreating the OSDs, some of the number got out of order and to calm my
>> OCD, I "fixed" it requiring me to manually modify the CRUSH map as the OSD
>> appeared in both hosts, this was before I increased the PGs.
>>
>> There is nothing critical on this cluster, but I'm using this as an
>> opportunity to understand Ceph in case we run into something similar in our
>> future production environment.
>>
>> HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; pool libvirt-pool
>> pg_num 256 > pgp_num 128
>> pg 4.bf is stuck inactive since forever, current state creating, last
>> acting [29,15,32]
>> pg 4.bf is stuck unclean since forever, current state creating, last
>> acting [29,15,32]
>> pool libvirt-pool pg_num 256 > pgp_num 128
>> [root at nodea ~]# ceph-osd --version
>> ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)
>>
>> More output http://pastebin.com/ajgpU7Zx
>>
>> Thanks
>>
>
> You should find out which OSD the PG maps to, and see if "ceph pg query"
> or the osd admin socket will expose anything useful about its state.
> -Greg
>
>
> --
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140930/4a5e4b93/attachment.htm>