Re: OSDMap checksums

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> Right, so let's talk about how we get into that situation:
> 1) Our existing OSDMap is "bad."
>   a) We were never "correct"
>   b) ...we went bad and didn't notice?
> 2) The Incremental we got is "bad".
>   a) It's not the original Incremental generated by the mon cluster
>   b) ...it got corrupted?
> 3) We don't understand the Incremental we got and applied it wrong
> (Any other categories I'm missing?)
> 
> Let's leave out case 1a, because that's largely a transitional issue.
> 1b should be protected against thanks to the checksums.

Yeah

> 2a just means that every time anybody sends an Incremental, it better
> be the original. 2b should again be protected against thanks to its
> checksum

Yeah, the Incremental will have a crc for the incremental itself and the 
full map it should result in.

> 3 is a little tricky: apparently we don't understand the full map, but
> we're allowed to stay in the cluster according to feature bits and
> other protections. So I think we're going to need to expose special
> feature bits associated with the OSD Map, or perhaps just the encoding
> version, in order to distinguish between "this isn't working right"
> and "I don't understand it but I can keep running".

Well, maybe.  If in both cases the end result that we want is for the OSD 
to get a correct/pristine map, it might not matter which one of those it 
is.

For example:

> Given that, we have two big challenges:
> 1) We need to get any currently-divergent OSDMaps into sync for the
> initial upgrade to this checksummed system,

If the general response is "uh oh, get a correct map from the mon", then 
this is transparently repaired on upgrade.  Assuming we don't stumble onto 
other bugs.. but they should be less likely if the maps are already 
divergent, I would think!

> 2) We need to prevent any OSD which doesn't completely understand a
> map format from transmitting any self-generated bufferlists.

Yes

> The only way OSDs transmit self-generated bufferlists is if the
> peer/client they're sending to isn't contiguous with the set of
> Incrementals the OSD already has; in this case the OSD will encode and
> send its oldest OSDMap.

It will pull the encoded buffer off disk.  But, that buffer was generated 
by encoding the OSDMap when it was first received, so same thing.

> (Or at least, this is the obvious and most
> important case where they do that.) This is a pretty rare case and I
> think we'd probably be okay with just having the peer go to the
> monitors instead if this happens?

Yeah.  The problem, though, is that this full OSDMap we stash on disk is 
also the map that the OSD uses for it's own purposes, and the most 
important thing we want to do is ensure that the OSD's agree on the 
mapping.  The fact that it's shared with clients is secondary to that.

> So then we just have the upgrade issue to deal with. I think if we
> prevent the monitors from enabling checksums until all the OSDs
> support it, and then just have the OSDs query the monitors for any
> non-conforming maps on upgrade, we should be good ? divergent OSDMaps
> are pretty rare.

So, maybe:

 - In general, the OSDs will fetch full maps from the mon if they find 
they can't generate them correctly from the incremental.  
 - We make that an exceptional case:
   - When there is an actual bit flip
   - On upgrade when we discover the maps went divergent ages ago
   - When the mons are careless and encode an OSDMap that OSDs 
can't generate themselves.

It's the third on I'm worried about.  We can spend a feature bit every 
time we change a structure in the OSDMap, but it will be expensive (in 
terms of feature bits) and a bit fragile (easy for a dev to modify one of 
those structs and not realize they also need to guard it being used) 
because the generic struct encoding stuff is so forgiving.

I think the options are:

 1- Whatever, be careful and use feature bits when needed.
 2- Make the OSDs do something smart about getting full maps from peers.
 3- Always have users upgrade OSDs before mons
 4- Completely change the nature of incremental maps so that we patch the 
previous map's encoding.  This will be immune to differences in encode 
behavior, but will probably double the size of the incrementals (assuming 
we keep both the semantic and bitwise diff).

?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux