PG stuck creating

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Yeah, the "last acting" set there is probably from prior to your lost
data and forced pg creation, so it might not have any bearing on
what's happening now.
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Sep 30, 2014 at 10:07 AM, Robert LeBlanc <robert at leblancnet.us> wrote:
> I rebuilt the primary OSD (29) in the hopes it would unblock whatever it
> was, but no luck. I'll check the admin socket and see if there is anything I
> can find there.
>
> On Tue, Sep 30, 2014 at 10:36 AM, Gregory Farnum <greg at inktank.com> wrote:
>>
>> On Tuesday, September 30, 2014, Robert LeBlanc <robert at leblancnet.us>
>> wrote:
>>>
>>> On our dev cluster, I've got a PG that won't create. We had a host fail
>>> with 10 OSDs that needed to be rebuilt. A number of other OSDs were down for
>>> a few days (did I mention this was a dev cluster?). The other OSDs
>>> eventually came up once the OSD maps caught up on them. I rebuilt the OSDs
>>> on all the hosts because we were running into XFS lockups with bcache. There
>>> were a number of PGs that could not be found when all the hosts were
>>> rebuilt. I tried restarting all the OSDs, the MONs, and deep scrubbing the
>>> OSDs they were on as well as the PGs. I performed a repair on the OSDs as
>>> well without any luck. One of pools had a recommendation to increase the
>>> PGs, so I increased it thinking it might be able to help.
>>>
>>> Nothing was helping and I could not find any reference to them so I force
>>> created them. That cleared up all but one that is creating due to the new PG
>>> number. Now, there is nothing I can do to unstick this one PG, I can't force
>>> create it, I can't increase the pgp_num, nada. At one point when recreating
>>> the OSDs, some of the number got out of order and to calm my OCD, I "fixed"
>>> it requiring me to manually modify the CRUSH map as the OSD appeared in both
>>> hosts, this was before I increased the PGs.
>>>
>>> There is nothing critical on this cluster, but I'm using this as an
>>> opportunity to understand Ceph in case we run into something similar in our
>>> future production environment.
>>>
>>> HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; pool libvirt-pool
>>> pg_num 256 > pgp_num 128
>>> pg 4.bf is stuck inactive since forever, current state creating, last
>>> acting [29,15,32]
>>> pg 4.bf is stuck unclean since forever, current state creating, last
>>> acting [29,15,32]
>>> pool libvirt-pool pg_num 256 > pgp_num 128
>>> [root at nodea ~]# ceph-osd --version
>>> ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)
>>>
>>> More output http://pastebin.com/ajgpU7Zx
>>>
>>> Thanks
>>
>>
>> You should find out which OSD the PG maps to, and see if "ceph pg query"
>> or the osd admin socket will expose anything useful about its state.
>> -Greg
>>
>>
>> --
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux