What's your cluster look like? I wonder if you can just remove the bad PG from osd.4 and let it recover from the existing osd.1 -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Sat, May 3, 2014 at 9:17 AM, Jeff Bachtel <jbachtel at bericotechnologies.com> wrote: > This is all on firefly rc1 on CentOS 6 > > I had an osd getting overfull, and misinterpreting directions I downed it > then manually removed pg directories from the osd mount. On restart and > after a good deal of rebalancing (setting osd weights as I should've > originally), I'm now at > > cluster de10594a-0737-4f34-a926-58dc9254f95f > health HEALTH_WARN 2 pgs backfill; 1 pgs incomplete; 1 pgs stuck > inactive; 308 pgs stuck unclean; recov > ery 1/2420563 objects degraded (0.000%); noout flag(s) set > monmap e7: 3 mons at > {controller1=10.100.2.1:6789/0,controller2=10.100.2.2:6789/0,controller3=10.100.2. > 3:6789/0}, election epoch 556, quorum 0,1,2 > controller1,controller2,controller3 > mdsmap e268: 1/1/1 up {0=controller1=up:active} > osdmap e3492: 5 osds: 5 up, 5 in > flags noout > pgmap v4167420: 320 pgs, 15 pools, 4811 GB data, 1181 kobjects > 9770 GB used, 5884 GB / 15654 GB avail > 1/2420563 objects degraded (0.000%) > 3 active > 12 active+clean > 2 active+remapped+wait_backfill > 1 incomplete > 302 active+remapped > client io 364 B/s wr, 0 op/s > > # ceph pg dump | grep 0.2f > dumped all in format plain > 0.2f 0 0 0 0 0 0 0 incomplete > 2014-05-03 11:38:01.526832 0'0 3492:23 [4] 4 [4] 4 > 2254'20053 2014-04-28 00:24:36.504086 2100'18109 2014-04-26 > 22:26:23.699330 > > # ceph pg map 0.2f > osdmap e3492 pg 0.2f (0.2f) -> up [4] acting [4] > > The pg query for the downed pg is at > https://gist.github.com/jeffb-bt/c8730899ff002070b325 > > Of course, the osd I manually mucked with is the only one the cluster is > picking up as up/acting. Now, I can query the pg and find epochs where other > osds (that I didn't jack up) were acting. And in fact, the latest of those > entries (osd.1) has the pg directory in its osd mount, and it's a good > healthy 59gb. > > I've tried manually rsync'ing (and preserving attributes) that set of > directories from osd.1 to osd.4 without success. Likewise I've tried copying > the directories over without attributes set. I've done many, many deep > scrubs but the pg query does not show the scrub timestamps being affected. > > I'm seeking ideas for either fixing metadata on the directory on osd.4 to > cause this pg to be seen/recognized, or ideas on forcing the cluster's pg > map to point to osd.1 for the incomplete pg (basically wiping out the > cluster's memory that osd.4 ever had 0.2f). Or any other solution :) It's > only 59g, so worst case I'll mark it lost and recreate the pg, but I'd > prefer to learn enough of the innards to understand what is going on, and > possible means of fixing it. > > Thanks for any help, > > Jeff > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com