Re: Supplying ID to ceph-disk when creating OSD

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 15 Feb 2017 10:13:14 -0800

On Wed, Feb 15, 2017 at 9:34 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Wed, 15 Feb 2017, Gregory Farnum wrote:
>> On Wed, Feb 15, 2017 at 8:59 AM, Wido den Hollander <wido@xxxxxxxx> wrote:
>> > Hi,
>> >
>> > Currently we can supply a OSD UUID to 'ceph-disk prepare', but we can't provide a OSD ID.
>> >
>> > With BlueStore coming I think the use-case for this is becoming very valid:
>> >
>> > 1. Stop OSD
>> > 2. Zap disk
>> > 3. Re-create OSD with same ID and UUID (with BlueStore)
>> > 4. Start OSD
>> >
>> > This allows for a in-place update of the OSD without modifying the CRUSHMap. For the cluster's point of view the OSD goes down and comes back up empty.
>> >
>> > There were some drawbacks around this and some dangers, so before I start working on a PR for this, any gotcaches which might be a problem?
>>
>> Yes. Unfortunately they are subtle and I don't remember them. :p
>>
>> I'd recommend going back and finding the historical discussions about
>> this to be sure. I *think* there were two main issues which prompted
>> us to remove that:
>> 1) people creating very large IDs, needlessly exploding OSDMap size
>> because it's all array-based,
>
> Working in terms of uuids should avoid this (i.e., users can't
> force a large oid id without significant effort).
>
>> 2) issues reusing the ID of lost OSDs versus PGs recognizing that the
>> OSD didn't have the data they wanted.
>>
>> 1 is still a bit of a problem, though if anybody has a good UX way of
>> handling it that's the real issue. 2 has hopefully been fixed over the
>> course of various refactors and improvements, but it's not something
>> I'd count on without checking very carefully.
>
> Oh yeah, this is the one that worries me.  I think the scenario we want
> to help users avoid is that osd N exists (but might be down at the moment)
> and a new, empty version of that same OSD is created and started.
> Peering will reasonably conclude that PG instances don't exist and may
> end up concluding that writes didn't happen.
>
> I think we want some sort of safety check so that the user as to say "this
> osd is dead" before they're allowed to create a new one in its image.  I
> think the simplest thing is to use the existing 'ceph osd lost ...'
> command for this.  I.e., the mon won't let a blank OSD start with a given
> uuid/id unless it is either a new osd rank or the rank is marked
> lost.

Certainly that's a good target and using "ceph osd lost" is *supposed*
to handle this case. But I remember seeing reports that even after
running "lost" — certainly after and I think before they reused the ID
— the OSDs were still waiting for data from the new incarnation. It
worried me and some of them were addressed but I don't know if the
issues have been resolved as I think it's an area of light testing.

>
> My main lingering doubt here is whether it's a bad idea to reuse a uuid;
> it seems like the whole point is that uuids are unique.  Perhaps instead
> the ceph-disk prepare --replace-oid NN command should replace the old uuid
> in the map with the new one as part of this process.  Probably something
> like 'ceph osd replace newuuid olduuid' to make the whole thing
> idempotent...
>
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html