On 14/01/2014 16:39, Sage Weil wrote: > Hi Andreas, > > On Tue, 14 Jan 2014, Andreas Joachim Peters wrote: >> After some exchange with Loic and the recent list discussion, >> the API of the EC plugin might need some clarification/extension in the ::encode method: >> >> Currently ::encode returns a map of bufferlists where the key is the index of [ 0 .. (m+k) ] >> and the value is the encoded buffer belonging to that stripe index: >> >> map<int, bufferlist> *encoded >> >> If a pyramid code is used the index would be [ 0 .. (m+k+(l*l_k)) ] where l is the number of local >> parity subgroups and l_k are the number of parity stripes per subgroup. The pyramid code would just >> chunk the input into the requested number of subgroups and compute local parity for them according >> to the configuration. >> >> With this API the caller has actually no clue how to group stripes together for intelligent >> placement allowing to keep subgroups with local parities together to minimize traffic >> during remapping and reconstruction. > > This is a bit awkward, it's true. I'm not sure there is a 'magic' way to > accomplish this. In the end, the CRUSH rule needs to have the required > width *and* should group nodes accordingly, but this mapping happens at a > very different layer in Ceph than the low-level plugin, so even if callers > had this information they wouldn't be able to do anything about it. > > Currently, what we need to do is make sure the EC plugin maps onto a > linear array of devices the same way that CRUSH does. For a pyramid code, > the CRUSH rule will be something like > > step take root > step choose 3 rack > step choose 5 osd > emit > > to get 3 groups of 5 devices as an array of size 15. That means the EC > plugin needs to map onto ranks that go something like > > 0-3 data > 4 local parity > 5-8 data > 9 local parity > 10-11 data > 12-13 global parity > 14 local parity > > (or whatever). > > Getting this to line up is a bit fragile, unfortunately. We could make > a plugin method that describes the subgrouping, but even then I'm not > sure how easy it is to programmatically validate that an arbitrary CRUSH > rule will behave well. Maybe it is enough to > > - have some way to query the layout of the EC plugin (e.g, 3 groups of 5). > - add a new 'osd crush rule create-pyramid ...' command to supplement > 'create-simple'. > > and document it well... > > sage I created http://tracker.ceph.com/issues/7146 to keep track of this feature. Cheers > >> >> Either there is an additional function returning the location sub-group [ 0 .. l ] for each created >> chunk or the ::encode function returns the chunks already grouped like: >> >> vector<int, map<int, bufferlist> *encoded >> >> Probably it would be good to have both. >> >> However it is not clear, if you can actually remap/recover an OSD without destroying the locality >> of pyramid encoding and if you can at all define CRUSH rules honoring the idea of chunk locality where >> shrinking/extension of pools keeps the locality. >> >> Last question is, if a remapping/recovery action is only possible with the traffic going through the primary OSD. >> >> If locality cannot be supported sufficiently now or in the future, should the API stay as it is? >> >> The ::decode function is fine, since the plugin knows about the locality of the available chunks and will >> select the cheapest decoding possible. >> >> Cheers Andreas.-- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Loïc Dachary, Artisan Logiciel Libre
Attachment:
signature.asc
Description: OpenPGP digital signature