Manually mucked up pg, need help fixing

jbachtel@xxxxxxxxxxxxxxxxxxxxxx (Jeff Bachtel) · Mon, 05 May 2014 20:20:58 -0400

noout was set while I manhandled osd.4 in and out of the cluster 
repeatedly (trying to set copy data from other osds and set attr to make 
osd.4 pick up that it had objects in pg 0.2f). It wasn't set before the 
problem, and isn't set currently.

I don't really know where you saw pool size = 1:

# for p in $(ceph osd lspools | awk 'BEGIN { RS="," } { print $2 }'); do 
ceph osd pool get $p size;  done
size: 2
size: 2
size: 2
size: 2
size: 2
size: 2
size: 2
size: 2
size: 2
size: 2
size: 2
size: 2
size: 2
size: 2

All pools are reporting size 2. The osd that last shared the incomplete 
pg (osd.1) had the pg directory intact and appropriately sized. However, 
it seems the pgmap was preferring osd.4 as the most recent copy of that 
pg, even when the pg directory was deleted. I guess because the pg was 
flagged incomplete, there was no further attempt to mirror the bogus pg 
onto another osd.

Since I sent my original email (this afternoon actually), I've nuked 
osd.4 and created an osd.5 on its old disc. I've still got pg 0.2f 
listed as down/incomplete/inactive despite marking its only home osd as 
lost. I'll follow up tomorrow after object recovery is as complete as 
it's going to get.

At this point though I'm shrugging and accepting the data loss, but 
ideas on how to create a new pg to replace the incomplete 0.2f would be 
deeply useful. I'm supposing ceph pg force_create_pg 0.2f would suffice.

Jeff

On 05/05/2014 07:46 PM, Gregory Farnum wrote:
> Oh, you've got no-out set. Did you lose an OSD at any point? Are you
> really running the system with pool size 1? I think you've managed to
> erase the up-to-date data, but not the records of that data's
> existence. You'll have to explore the various "lost" commands, but I'm
> not sure what the right approach is here. It's possible you're just
> out of luck after manually adjusting the store improperly.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Mon, May 5, 2014 at 4:39 PM, Jeff Bachtel
> <jbachtel at bericotechnologies.com> wrote:
>> Thanks. That is a cool utility, unfortunately I'm pretty sure the pg in
>> question had a cephfs object instead of rbd images (because mounting cephfs
>> is the only noticeable brokenness).
>>
>> Jeff
>>
>>
>> On 05/05/2014 06:43 PM, Jake Young wrote:
>>
>> I was in a similar situation where I could see the PGs data on an osd, but
>> there was nothing I could do to force the pg to use that osd's copy.
>>
>> I ended up using the rbd_restore tool to create my rbd on disk and then I
>> reimported it into the pool.
>>
>> See this thread for info on rbd_restore:
>> http://www.spinics.net/lists/ceph-devel/msg11552.html
>>
>> Of course, you have to copy all of the pieces of the rbd image on one file
>> system somewhere (thank goodness for thin provisioning!) for the tool to
>> work.
>>
>> There really should be a better way.
>>
>> Jake
>>
>> On Monday, May 5, 2014, Jeff Bachtel <jbachtel at bericotechnologies.com>
>> wrote:
>>> Well, that'd be the ideal solution. Please check out the github gist I
>>> posted, though. It seems that despite osd.4 having nothing good for pg 0.2f,
>>> the cluster does not acknowledge any other osd has a copy of the pg. I've
>>> tried downing osd.4 and manually deleting the pg directory in question with
>>> the hope that the cluster would roll back epochs for 0.2f, but all it does
>>> is recreate the pg directory (empty) on osd.4.
>>>
>>> Jeff
>>>
>>> On 05/05/2014 04:33 PM, Gregory Farnum wrote:
>>>> What's your cluster look like? I wonder if you can just remove the bad
>>>> PG from osd.4 and let it recover from the existing osd.1
>>>> -Greg
>>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>>
>>>>
>>>> On Sat, May 3, 2014 at 9:17 AM, Jeff Bachtel
>>>> <jbachtel at bericotechnologies.com> wrote:
>>>>> This is all on firefly rc1 on CentOS 6
>>>>>
>>>>> I had an osd getting overfull, and misinterpreting directions I downed
>>>>> it
>>>>> then manually removed pg directories from the osd mount. On restart and
>>>>> after a good deal of rebalancing (setting osd weights as I should've
>>>>> originally), I'm now at
>>>>>
>>>>>       cluster de10594a-0737-4f34-a926-58dc9254f95f
>>>>>        health HEALTH_WARN 2 pgs backfill; 1 pgs incomplete; 1 pgs stuck
>>>>> inactive; 308 pgs stuck unclean; recov
>>>>> ery 1/2420563 objects degraded (0.000%); noout flag(s) set
>>>>>        monmap e7: 3 mons at
>>>>>
>>>>> {controller1=10.100.2.1:6789/0,controller2=10.100.2.2:6789/0,controller3=10.100.2.
>>>>> 3:6789/0}, election epoch 556, quorum 0,1,2
>>>>> controller1,controller2,controller3
>>>>>        mdsmap e268: 1/1/1 up {0=controller1=up:active}
>>>>>        osdmap e3492: 5 osds: 5 up, 5 in
>>>>>               flags noout
>>>>>         pgmap v4167420: 320 pgs, 15 pools, 4811 GB data, 1181 kobjects
>>>>>               9770 GB used, 5884 GB / 15654 GB avail
>>>>>               1/2420563 objects degraded (0.000%)
>>>>>                      3 active
>>>>>                     12 active+clean
>>>>>                      2 active+remapped+wait_backfill
>>>>>                      1 incomplete
>>>>>                    302 active+remapped
>>>>>     client io 364 B/s wr, 0 op/s
>>>>>
>>>>> # ceph pg dump | grep 0.2f
>>>>> dumped all in format plain
>>>>> 0.2f    0       0       0       0       0       0       0 incomplete
>>>>> 2014-05-03 11:38:01.526832 0'0      3492:23 [4] 4       [4]     4
>>>>> 2254'20053      2014-04-28 00:24:36.504086      2100'18109 2014-04-26
>>>>> 22:26:23.699330
>>>>>
>>>>> # ceph pg map 0.2f
>>>>> osdmap e3492 pg 0.2f (0.2f) -> up [4] acting [4]
>>>>>
>>>>> The pg query for the downed pg is at
>>>>> https://gist.github.com/jeffb-bt/c8730899ff002070b325
>>>>>
>>>>> Of course, the osd I manually mucked with is the only one the cluster is
>>>>> picking up as up/acting. Now, I can query the pg and find epochs where
>>>>> other
>>>>> osds (that I didn't jack up) were acting. And in fact, the latest of
>>>>> those
>>>>> entries (osd.1) has the pg directory in its osd mount, and it's a good
>>>>> healthy 59gb.
>>>>>
>>>>> I've tried manually rsync'ing (and preserving attributes) that set of
>>>>> directories from osd.1 to osd.4 without success. Likewise I've tried
>>>>> copying
>>>>> the directories over without attributes set. I've done many, many deep
>>>>> scrubs but the pg query does not show the scrub timestamps being
>>>>> affected.
>>>>>
>>>>> I'm seeking ideas for either fixing metadata on the directory on osd.4
>>>>> to
>>>>> cause this pg to be seen/recognized, or ideas on forcing the cluster's
>>>>> pg
>>>>> map to point to osd.1 for the incomplete pg (basically wiping out the
>>>>> cluster's memory that osd.4 ever had 0.2f). Or any other solution :)
>>>>> It's
>>>>> only 59g, so worst case I'll mark it lost and recreate the pg, but I'd
>>>>> prefer to learn enough of the innards to understand what is going on,
>>>>> and
>>>>> possible means of fixing it.
>>>>>
>>>>> Thanks for any help,
>>>>>
>>>>> Jeff
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>