On 12/01/2014 15:31, Kyle Bader wrote: >> If we had RS(6:3:3) 6 data chunks, 3 coding chunks, 3 local chunks, the following rule could be used to spread it over 3 datacenters: >> >> rule erasure_ruleset { >> ruleset 1 >> type erasure >> min_size 3 >> max_size 20 >> step set_chooseleaf_tries 5 >> step take root >> step choose indep 3 type datacenter >> step choose indep 4 type device >> step emit >> } >> >> crushtool -o /tmp/t.map --num_osds 500 --build node straw 10 datacenter straw 10 root straw 0 >> crushtool -d /tmp/t.map -o /tmp/t.txt # edit the ruleset as above >> crushtool -c /tmp/t.txt -o /tmp/t.map ; crushtool -i /tmp/t.map --show-bad-mappings --show-statistics --test --rule 1 --x 1 --num-rep 12 >> rule 1 (erasure_ruleset), x = 1..1, numrep = 12..12 >> CRUSH rule 1 x 1 [399,344,343,321,51,78,9,12,274,263,270,213] >> rule 1 (erasure_ruleset) num_rep 12 result size == 12: 1/1 >> >> 399 is in datacenter 3, node 9, device 9 etc. It shows that the first four are in datacenter 3, the next in datacenter zero and the last four in datacenter 2. >> >> If the function calculating erasure code spreads local chunks evenly ( 321, 12, 213 for instance ), they will effectively be located as you suggest. Andreas may have a different view on this question though. >> >> In case 78 goes missing ( and assuming all other chunks are good ), it can be rebuilt with 512, 9, 12 only. However, if the primary driving the reconstruction is 270, data will need to cross datacenter boundaries. Would it be cheaper to elect a primary closest ( in the sense of get_common_ancestor_distance https://github.com/ceph/ceph/blob/master/src/crush/CrushWrapper.h#L487 ) to the OSD to be recovered ? Only Sam or David could give you an authoritative answer. > There is a typo : "rebuilt with 51, 9, 12" > That scheme does work out, except I'm not sure how the > recovering/remapped OSD can makes sure to read the local copies. I'll > wait to hear what Sam and David have to say. That said, what do you > think of this diagram of a 12,2,2: > > http://storagemojo.com/wp-content/uploads/2012/09/LRC_parity.PNG from [4] > > With a 12,2,2 LRC code the stripe is split into 12 units, those 12 > units are used to generate 2 global parity units. The original 12 > units are then split into 2 groups, and each group separately > generates an additional local parity unit. Each group of 6 original > units and 1 parity unit are passed through CRUSH for placement with a > common ancestor (for local recovery). This would be low overhead, > certainly less than a replicated, and be friendlier on inter-dc links > during remapping and recovery. > > [4] http://storagemojo.com/2012/09/26/more-efficient-erasure-coding-in-windows-azure-storage/ > How is it different from what is described above ? There must be something I fail to understand. In the above example CRUSH rule 1 x 1 [399,344,343,321,51,78,9,12,274,263,270,213] would be datacenter 3 399 data 344 data 343 global parity 321 local parity datacenter 0 51 data 78 data 9 global parity 12 local parity datacenter 2 274 data 263 data 270 global parity 213 local parity If 9 is missing, recovery looks for the chunks that have the closest common ancestor, finds 51, 78, 12 and since there only is one missing chunk uses local parity chunk 12 to rebuild it. If 9 and 12 are missing, it needs all remaining data + global parity chunks to rebuild. In this example the overhead of erasure coding is high but increasing the number of data chunks reduces it. Cheers -- Loïc Dachary, Artisan Logiciel Libre
Attachment:
signature.asc
Description: OpenPGP digital signature