PG stuck creating

greg@xxxxxxxxxxx (Gregory Farnum) · Tue, 30 Sep 2014 10:12:54 -0700

Yeah, the "last acting" set there is probably from prior to your lost
data and forced pg creation, so it might not have any bearing on
what's happening now.
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Tue, Sep 30, 2014 at 10:07 AM, Robert LeBlanc <robert at leblancnet.us> wrote:
> I rebuilt the primary OSD (29) in the hopes it would unblock whatever it
> was, but no luck. I'll check the admin socket and see if there is anything I
> can find there.
>
> On Tue, Sep 30, 2014 at 10:36 AM, Gregory Farnum <greg at inktank.com> wrote:
>>
>> On Tuesday, September 30, 2014, Robert LeBlanc <robert at leblancnet.us>
>> wrote:
>>>
>>> On our dev cluster, I've got a PG that won't create. We had a host fail
>>> with 10 OSDs that needed to be rebuilt. A number of other OSDs were down for
>>> a few days (did I mention this was a dev cluster?). The other OSDs
>>> eventually came up once the OSD maps caught up on them. I rebuilt the OSDs
>>> on all the hosts because we were running into XFS lockups with bcache. There
>>> were a number of PGs that could not be found when all the hosts were
>>> rebuilt. I tried restarting all the OSDs, the MONs, and deep scrubbing the
>>> OSDs they were on as well as the PGs. I performed a repair on the OSDs as
>>> well without any luck. One of pools had a recommendation to increase the
>>> PGs, so I increased it thinking it might be able to help.
>>>
>>> Nothing was helping and I could not find any reference to them so I force
>>> created them. That cleared up all but one that is creating due to the new PG
>>> number. Now, there is nothing I can do to unstick this one PG, I can't force
>>> create it, I can't increase the pgp_num, nada. At one point when recreating
>>> the OSDs, some of the number got out of order and to calm my OCD, I "fixed"
>>> it requiring me to manually modify the CRUSH map as the OSD appeared in both
>>> hosts, this was before I increased the PGs.
>>>
>>> There is nothing critical on this cluster, but I'm using this as an
>>> opportunity to understand Ceph in case we run into something similar in our
>>> future production environment.
>>>
>>> HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; pool libvirt-pool
>>> pg_num 256 > pgp_num 128
>>> pg 4.bf is stuck inactive since forever, current state creating, last
>>> acting [29,15,32]
>>> pg 4.bf is stuck unclean since forever, current state creating, last
>>> acting [29,15,32]
>>> pool libvirt-pool pg_num 256 > pgp_num 128
>>> [root at nodea ~]# ceph-osd --version
>>> ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)
>>>
>>> More output http://pastebin.com/ajgpU7Zx
>>>
>>> Thanks
>>
>>
>> You should find out which OSD the PG maps to, and see if "ceph pg query"
>> or the osd admin socket will expose anything useful about its state.
>> -Greg
>>
>>
>> --
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>