Re: OSDMap checksums

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Aug 19, 2014 at 9:49 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>> Right, so let's talk about how we get into that situation:
>> 1) Our existing OSDMap is "bad."
>>   a) We were never "correct"
>>   b) ...we went bad and didn't notice?
>> 2) The Incremental we got is "bad".
>>   a) It's not the original Incremental generated by the mon cluster
>>   b) ...it got corrupted?
>> 3) We don't understand the Incremental we got and applied it wrong
>> (Any other categories I'm missing?)
>>
>> Let's leave out case 1a, because that's largely a transitional issue.
>> 1b should be protected against thanks to the checksums.
>
> Yeah
>
>> 2a just means that every time anybody sends an Incremental, it better
>> be the original. 2b should again be protected against thanks to its
>> checksum
>
> Yeah, the Incremental will have a crc for the incremental itself and the
> full map it should result in.
>
>> 3 is a little tricky: apparently we don't understand the full map, but
>> we're allowed to stay in the cluster according to feature bits and
>> other protections. So I think we're going to need to expose special
>> feature bits associated with the OSD Map, or perhaps just the encoding
>> version, in order to distinguish between "this isn't working right"
>> and "I don't understand it but I can keep running".
>
> Well, maybe.  If in both cases the end result that we want is for the OSD
> to get a correct/pristine map, it might not matter which one of those it
> is.
>
> For example:
>
>> Given that, we have two big challenges:
>> 1) We need to get any currently-divergent OSDMaps into sync for the
>> initial upgrade to this checksummed system,
>
> If the general response is "uh oh, get a correct map from the mon", then
> this is transparently repaired on upgrade.  Assuming we don't stumble onto
> other bugs.. but they should be less likely if the maps are already
> divergent, I would think!
>
>> 2) We need to prevent any OSD which doesn't completely understand a
>> map format from transmitting any self-generated bufferlists.
>
> Yes
>
>> The only way OSDs transmit self-generated bufferlists is if the
>> peer/client they're sending to isn't contiguous with the set of
>> Incrementals the OSD already has; in this case the OSD will encode and
>> send its oldest OSDMap.
>
> It will pull the encoded buffer off disk.  But, that buffer was generated
> by encoding the OSDMap when it was first received, so same thing.
>
>> (Or at least, this is the obvious and most
>> important case where they do that.) This is a pretty rare case and I
>> think we'd probably be okay with just having the peer go to the
>> monitors instead if this happens?
>
> Yeah.  The problem, though, is that this full OSDMap we stash on disk is
> also the map that the OSD uses for it's own purposes, and the most
> important thing we want to do is ensure that the OSD's agree on the
> mapping.

Okay, I think this is why I was confused. If we're not just adding new
fields (with optional data) to the OSDMap, but are actually changing
the semantic meaning of existing structures, I don't see how we do
this without using feature bits. Getting a new OSDMap from elsewhere
isn't going to help if the OSD isn't interpreting the data properly,
and if it's capable of interpreting the data properly then it ought to
be able to understand how to apply the Incremental...

As far as I can tell, checksumming incrementals are good for two
things besides detecting bit flips:
1) It's easy to extend to signing the Incremental, which is more secure
2) It protects against accidental divergence like we saw when we added
the extra heartbeat IP fields

Trying to get anything more out of it seems like attacking a problem
at very much the wrong layer.

>  The fact that it's shared with clients is secondary to that.
>
>> So then we just have the upgrade issue to deal with. I think if we
>> prevent the monitors from enabling checksums until all the OSDs
>> support it, and then just have the OSDs query the monitors for any
>> non-conforming maps on upgrade, we should be good ? divergent OSDMaps
>> are pretty rare.
>
> So, maybe:
>
>  - In general, the OSDs will fetch full maps from the mon if they find
> they can't generate them correctly from the incremental.
>  - We make that an exceptional case:
>    - When there is an actual bit flip
>    - On upgrade when we discover the maps went divergent ages ago
>    - When the mons are careless and encode an OSDMap that OSDs
> can't generate themselves.
>
> It's the third on I'm worried about.  We can spend a feature bit every
> time we change a structure in the OSDMap, but it will be expensive (in
> terms of feature bits) and a bit fragile (easy for a dev to modify one of
> those structs and not realize they also need to guard it being used)
> because the generic struct encoding stuff is so forgiving.
>
> I think the options are:
>
>  1- Whatever, be careful and use feature bits when needed.
>  2- Make the OSDs do something smart about getting full maps from peers.
>  3- Always have users upgrade OSDs before mons
>  4- Completely change the nature of incremental maps so that we patch the
> previous map's encoding.  This will be immune to differences in encode
> behavior, but will probably double the size of the incrementals (assuming
> we keep both the semantic and bitwise diff).

As long as the field isn't changing how the mapping of data works (and
if it is, you *need* the full guards) then I think we can just have
feature bits or some equivalent *within* the OSDMap encoding. If an
Incremental arrives with stuff you don't understand, but you meet the
minimum requirements to even look at it, you just stop worrying about
the generated checksums, and make sure not to send out any full maps
which you encoded yourself from that point on.
Right?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux