Re: EC API to expose locality

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Andreas,

On Tue, 14 Jan 2014, Andreas Joachim Peters wrote:
> After some exchange with Loic and the recent list discussion, 
> the API of the EC plugin might need some clarification/extension in the ::encode method:
> 
> Currently ::encode returns a map of bufferlists where the key is the index of [ 0 .. (m+k) ] 
> and the value is the encoded buffer belonging to that stripe index:
> 
> map<int, bufferlist> *encoded
> 
> If a pyramid code is used the index would be [ 0 .. (m+k+(l*l_k)) ] where l is the number of local 
> parity subgroups and l_k are the number of parity stripes per subgroup. The pyramid code would just
> chunk the input into the requested number of subgroups and compute local parity for them according 
> to the configuration.
> 
> With this API the caller has actually no clue how to group stripes together for intelligent
> placement allowing to keep subgroups with local parities together to minimize traffic 
> during remapping and reconstruction.

This is a bit awkward, it's true.  I'm not sure there is a 'magic' way to 
accomplish this.  In the end, the CRUSH rule needs to have the required 
width *and* should group nodes accordingly, but this mapping happens at a 
very different layer in Ceph than the low-level plugin, so even if callers 
had this information they wouldn't be able to do anything about it.

Currently, what we need to do is make sure the EC plugin maps onto a 
linear array of devices the same way that CRUSH does.  For a pyramid code, 
the CRUSH rule will be something like 

 step take root
 step choose 3 rack
 step choose 5 osd
 emit

to get 3 groups of 5 devices as an array of size 15.  That means the EC
plugin needs to map onto ranks that go something like

 0-3 data
 4 local parity
 5-8 data
 9 local parity
 10-11 data
 12-13 global parity
 14 local parity

(or whatever).

Getting this to line up is a bit fragile, unfortunately.  We could make
a plugin method that describes the subgrouping, but even then I'm not
sure how easy it is to programmatically validate that an arbitrary CRUSH
rule will behave well.  Maybe it is enough to

- have some way to query the layout of the EC plugin (e.g, 3 groups of 5).
- add a new 'osd crush rule create-pyramid ...' command to supplement 
  'create-simple'.

and document it well... 

sage


> 
> Either there is an additional function returning the location sub-group [ 0 .. l ] for each created 
> chunk or the ::encode function returns the chunks already grouped like:
> 
> vector<int, map<int, bufferlist> *encoded
> 
> Probably it would be good to have both.
> 
> However it is not clear, if you can actually remap/recover an OSD without destroying the locality 
> of pyramid encoding and if you can at all define CRUSH rules honoring the idea of chunk locality where
> shrinking/extension of pools keeps the locality.
> 
> Last question is, if a remapping/recovery action is only possible with the traffic going through the primary OSD.
> 
> If locality cannot be supported sufficiently now or in the future, should the API stay as it is?
> 
> The ::decode function is fine, since the plugin knows about the locality of the available chunks and will
> select the cheapest decoding possible.
> 
> Cheers Andreas.--
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux