Yeah, the "last acting" set there is probably from prior to your lost data and forced pg creation, so it might not have any bearing on what's happening now. Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Sep 30, 2014 at 10:07 AM, Robert LeBlanc <robert at leblancnet.us> wrote: > I rebuilt the primary OSD (29) in the hopes it would unblock whatever it > was, but no luck. I'll check the admin socket and see if there is anything I > can find there. > > On Tue, Sep 30, 2014 at 10:36 AM, Gregory Farnum <greg at inktank.com> wrote: >> >> On Tuesday, September 30, 2014, Robert LeBlanc <robert at leblancnet.us> >> wrote: >>> >>> On our dev cluster, I've got a PG that won't create. We had a host fail >>> with 10 OSDs that needed to be rebuilt. A number of other OSDs were down for >>> a few days (did I mention this was a dev cluster?). The other OSDs >>> eventually came up once the OSD maps caught up on them. I rebuilt the OSDs >>> on all the hosts because we were running into XFS lockups with bcache. There >>> were a number of PGs that could not be found when all the hosts were >>> rebuilt. I tried restarting all the OSDs, the MONs, and deep scrubbing the >>> OSDs they were on as well as the PGs. I performed a repair on the OSDs as >>> well without any luck. One of pools had a recommendation to increase the >>> PGs, so I increased it thinking it might be able to help. >>> >>> Nothing was helping and I could not find any reference to them so I force >>> created them. That cleared up all but one that is creating due to the new PG >>> number. Now, there is nothing I can do to unstick this one PG, I can't force >>> create it, I can't increase the pgp_num, nada. At one point when recreating >>> the OSDs, some of the number got out of order and to calm my OCD, I "fixed" >>> it requiring me to manually modify the CRUSH map as the OSD appeared in both >>> hosts, this was before I increased the PGs. >>> >>> There is nothing critical on this cluster, but I'm using this as an >>> opportunity to understand Ceph in case we run into something similar in our >>> future production environment. >>> >>> HEALTH_WARN 1 pgs stuck inactive; 1 pgs stuck unclean; pool libvirt-pool >>> pg_num 256 > pgp_num 128 >>> pg 4.bf is stuck inactive since forever, current state creating, last >>> acting [29,15,32] >>> pg 4.bf is stuck unclean since forever, current state creating, last >>> acting [29,15,32] >>> pool libvirt-pool pg_num 256 > pgp_num 128 >>> [root at nodea ~]# ceph-osd --version >>> ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187) >>> >>> More output http://pastebin.com/ajgpU7Zx >>> >>> Thanks >> >> >> You should find out which OSD the PG maps to, and see if "ceph pg query" >> or the osd admin socket will expose anything useful about its state. >> -Greg >> >> >> -- >> Software Engineer #42 @ http://inktank.com | http://ceph.com > >