Re: enable old OSD snapshot to re-join a cluster

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 18 Dec 2013 11:39:45 -0800



On Tue, Dec 17, 2013 at 3:36 AM, Alexandre Oliva <oliva@xxxxxxx> wrote:
> On Feb 20, 2013, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>
>> On Tue, Feb 19, 2013 at 2:52 PM, Alexandre Oliva <oliva@xxxxxxx> wrote:
>>> It recently occurred to me that I messed up an OSD's storage, and
>>> decided that the easiest way to bring it back was to roll it back to an
>>> earlier snapshot I'd taken (along the lines of clustersnap) and let it
>>> recover from there.
>>>
>>> The problem with that idea was that the cluster had advanced too much
>>> since the snapshot was taken: the latest OSDMap known by that snapshot
>>> was far behind the range still carried by the monitors.
>>>
>>> Determined to let that osd recover from all the data it already had,
>>> rather than restarting from scratch, I hacked up a “solution” that
>>> appears to work: with the patch below, the OSD will use the contents of
>>> an earlier OSDMap (presumably the latest one it has) for a newer OSDMap
>>> it can't get any more.
>>>
>>> A single run of osd with this patch was enough for it to pick up the
>>> newer state and join the cluster; from then on, the patched osd was no
>>> longer necessary, and presumably should not be used except for this sort
>>> of emergency.
>>>
>>> Of course this can only possibly work reliably if other nodes are up
>>> with same or newer versions of each of the PGs (but then, rolling back
>>> the OSD to an older snapshot would't be safe otherwise).  I don't know
>>> of any other scenarios in which this patch will not recover things
>>> correctly, but unless someone far more familiar with ceph internals than
>>> I am vows for it, I'd recommend using this only if you're really
>>> desperate to avoid a recovery from scratch, and you save snapshots of
>>> the other osds (as you probably already do, or you wouldn't have older
>>> snapshots to rollback to :-) and the mon *before* you get the patched
>>> ceph-osd to run, and that you stop the mds or otherwise avoid changes
>>> that you're not willing to lose should the patch not work for you and
>>> you have to go back to the saved state and let the osd recover from
>>> scratch.  If it works, lucky us; if it breaks, well, I told you :-)
>
>> Yeah, this ought to basically work but it's very dangerous —
>> potentially breaking invariants about cluster state changes, etc. I
>> wouldn't use it if the cluster wasn't otherwise healthy; other nodes
>> breaking in the middle of this operation could cause serious problems,
>> etc. I'd much prefer that one just recovers over the wire using normal
>> recovery paths... ;)
>
> Here's an updated version of the patch, that makes it much faster than
> the earlier version, particularly when the gap between the latest osdmap
> known by the osd and the earliest osdmap known by the cluster is large.
> There are some #if0-ed out portions of the code for experiments that
> turned out to be unnecessary, but that I didn't quite want to throw
> away.  I've used this patch for quite a while, and I wanted to post a
> working version, rather than some cleaned-up version in which I might
> accidentally introduce errors.

Is this actually still necessary in the latest dumpling and emperor
branches? I thought sufficiently-old OSDs would go through backfill
with the new PG members in order to get up-to-date without copying all
the data.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html