Re: EC API to expose locality

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 14/01/2014 16:39, Sage Weil wrote:
> Hi Andreas,
> 
> On Tue, 14 Jan 2014, Andreas Joachim Peters wrote:
>> After some exchange with Loic and the recent list discussion, 
>> the API of the EC plugin might need some clarification/extension in the ::encode method:
>>
>> Currently ::encode returns a map of bufferlists where the key is the index of [ 0 .. (m+k) ] 
>> and the value is the encoded buffer belonging to that stripe index:
>>
>> map<int, bufferlist> *encoded
>>
>> If a pyramid code is used the index would be [ 0 .. (m+k+(l*l_k)) ] where l is the number of local 
>> parity subgroups and l_k are the number of parity stripes per subgroup. The pyramid code would just
>> chunk the input into the requested number of subgroups and compute local parity for them according 
>> to the configuration.
>>
>> With this API the caller has actually no clue how to group stripes together for intelligent
>> placement allowing to keep subgroups with local parities together to minimize traffic 
>> during remapping and reconstruction.
> 
> This is a bit awkward, it's true.  I'm not sure there is a 'magic' way to 
> accomplish this.  In the end, the CRUSH rule needs to have the required 
> width *and* should group nodes accordingly, but this mapping happens at a 
> very different layer in Ceph than the low-level plugin, so even if callers 
> had this information they wouldn't be able to do anything about it.
> 
> Currently, what we need to do is make sure the EC plugin maps onto a 
> linear array of devices the same way that CRUSH does.  For a pyramid code, 
> the CRUSH rule will be something like 
> 
>  step take root
>  step choose 3 rack
>  step choose 5 osd
>  emit
> 
> to get 3 groups of 5 devices as an array of size 15.  That means the EC
> plugin needs to map onto ranks that go something like
> 
>  0-3 data
>  4 local parity
>  5-8 data
>  9 local parity
>  10-11 data
>  12-13 global parity
>  14 local parity
> 
> (or whatever).
> 
> Getting this to line up is a bit fragile, unfortunately.  We could make
> a plugin method that describes the subgrouping, but even then I'm not
> sure how easy it is to programmatically validate that an arbitrary CRUSH
> rule will behave well.  Maybe it is enough to
> 
> - have some way to query the layout of the EC plugin (e.g, 3 groups of 5).
> - add a new 'osd crush rule create-pyramid ...' command to supplement 
>   'create-simple'.
> 
> and document it well... 
> 
> sage

I created http://tracker.ceph.com/issues/7146 to keep track of this feature.

Cheers
> 
>>
>> Either there is an additional function returning the location sub-group [ 0 .. l ] for each created 
>> chunk or the ::encode function returns the chunks already grouped like:
>>
>> vector<int, map<int, bufferlist> *encoded
>>
>> Probably it would be good to have both.
>>
>> However it is not clear, if you can actually remap/recover an OSD without destroying the locality 
>> of pyramid encoding and if you can at all define CRUSH rules honoring the idea of chunk locality where
>> shrinking/extension of pools keeps the locality.
>>
>> Last question is, if a remapping/recovery action is only possible with the traffic going through the primary OSD.
>>
>> If locality cannot be supported sufficiently now or in the future, should the API stay as it is?
>>
>> The ::decode function is fine, since the plugin knows about the locality of the available chunks and will
>> select the cheapest decoding possible.
>>
>> Cheers Andreas.--
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment: signature.asc
Description: OpenPGP digital signature


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux