On Tue, 17 Apr 2012 01:57:35 +0200 Edward Shishkin <edward@xxxxxxxxxx> wrote: > Comment 2. There is a disadvantage: in this approach all files > > /foo > /dir1/foo > /dir1/dir2/foo > ... > > will be accumulated on the same brick. However it is possible to > "salt" a short file names with gfid (or another id) of respective > directory before hashing, to avoid possible attacks. That's the easy problem. The harder problem is that the "only split one brick" approach creates imbalances that are *permanent and accumulative*. In other words, each change is likely to take us further from optimal distribution, so that measures such as virtual nodes become strictly necessary to preserve any semblance of proper load/capacity balancing. I had implemented an approach based on moving files only from one hash range instead of only from one brick (a demonstrably superior generalization of the idea) and it still exhibits this behavior. You had those results before you ever wrote this proposal. We still need a rebalance method that restores "perfect" distribution, even if it's not the only one we use. > 2. Virtual nodes Virtual nodes are a bad idea. Even the people who included them in Dynamo design have said as much since. The main problem is that the accumulate infinitely. If one relies on them too much to fix other problems, the result is very large numbers of virtual node IDs and very large lookup tables. This is compounded in our case by the fact that information about node IDs or ranges is contained in xattrs, so adding too many will make fetching that information (a frequent operation) less efficient. At the very least, virtual node IDs need to be aggressively pruned, even if that means incurring some data-movement cost. Even better, I think we should stay away from them entirely. My favorite alternative is multiple hash rings, as follows: ring_hash = hash(file_id) ring_to_use = ring_hash % num_rings node_hash = hash(ring_hash/num_rings) node_to_use = lookup(node_hash,lookup_table[ring_to_use]) This approach rapidly approaches the same flexibility/efficiency as virtual node IDs, with quite small values for num_rings. With careful assignment of ranges within each ring, it can also assure that the load when a node fails is spread out across up to num_rings successors instead of just one (as with a single ring even when using virtual nodes). > To achieve high availability and durability we replicate files on > multiple bricks. In our case replication can be implemented as a set > of operations with the same ring R, so we don't create a separate > translator for replication. Yes, we do and we will. Distribution and replication are both extremely complex. Our code for both represents years of accumulated expertise handling all manner of difficult cases, so for basic modularity/maintainability reasons they will remain separate. That said, a more nuanced relationship between distribution sets and replica sets would be welcome. If we allowed replication across arbitrary (and possibly overlapping) sets of R nodes instead of statically partitioning the bricks into 1/R subvolumes of DHT, we'd gain at least the following benefits. (1) Ability to support multiple replication levels within a single volume. (2) Smoother addition and removal of bricks. (3) Better distribution of load from a failed brick. Unfortunately, this requires that DHT be able to "spawn" AFR translators dynamically, without them being represented directly in the volfile. I've written code to do this, verified that it actually works (or at least did at that time), and published the results to the cloudfs-devel mailing list. It still requires a lot of work to make sure that option changes and other kinds of reconfiguration are handled correctly, but long term it's the direction we need to go. > APPENDIX > > > > ---------------------------------------------------------------------- > > In 3 distributed hash tables with different hashing techniques > > . GlusterFS DHT translator (3.2.5) > . 64-bit ring with phi based on md5, R=1 (no replication), S=7 > . 64-bit ring with phi based on md5, R=1 (no replication), S=20 > > we run the same scenario: > > 1) Create 100 files ("file00", "file01", ..., "file99") in a volume > composed of 9 bricks: > > "host:/root/exp0", > "host:/root/exp1", > ... > > "host:/root/exp8". > > 2) Add one brick "host:/root/exp9"; > 3) re-balance; These results are barely usable. When I pushed you to write up and distribute this proposal instead of just beginning to hack on the code, I also provided you with scripts that apply different rebalancing methods and measure the results across different scenarios to generate highly readable tables showing the data-movement effects. Why did you use a weaker methodology generating less readable results? Also, please use code from master/3.3 for your testing and development. I don't think there have been any significant changes in this particular area, but one should always compare results to the version one proposes to replace.