On Mon, 21 Nov 2011 21:34:30 -0200 Daniel van Ham Colchete <daniel.colchete@xxxxxxxxx> wrote: > So, I have a suggestion that fixes this problem. I call it DHT-based > replication. It is different from DHT-based distribution. I already > implemented it internally, it already worked, at least here. Wonderful. Patches would be most welcome. > Giving > the amount of money and energy this idea saves, I think this idea is > worth a million bucks (on your hands at least). Even though it is > really simple. I'm giving it to you guys for free. Just please give > credit if you are going to use it. > > It is very simple: hash(name) to locate the primary server (as > usual), and use another hash, like hash(name + "#copy2") for the > second copy and so on. You just have to certify that it doesn't fall > into the same server, but this is easy: hash(name + "#copy2/try2"). I've discussed this issue in some of my presentations, and there's a whole range of possible approaches. * Distribute across pairs (current approach). You did a good job explaining some of the problems here. * Overlapping pairs around a ring. This is possible, and has the advantage that load for a failed server is distributed between its two neighbors instead of one partner (and similarly for N>2). Unfortunately, it's a bit incompatible with the way DHT currently works, with per-directory layout maps and an assumption that a hashed file's parent directory will exist on the same brick (essentially meaning directories must exist *everywhere*). I have in the past implemented an alternative DHT translator that uses a Dynamo-style global ring and would be much more friendly to this kind of replication, but it's nowhere near production quality. It also requires a more dynamic translator-graph infrastructure, because it would involve creating AFR translators (or their moral equivalent) that aren't explicit in the volfile, including when bricks are added. * Iterative hashing (your suggestion). This extends on both the advantages and drawbacks of the previous approach. Because the number of AFR combinations now grows exponentially, it also requires that the translator-graph infrastructure be much more scalable as well as more dynamic. BTW, Kademlia's XOR-based consistent hashing offers similar characteristics to iterative ring-based hashing, and might be preferable for reasons too academic to describe here. As I'm sure you can see, there are a few technical issues that remain before a simple idea can be turned into a workable reality. One compromise I've considered is to assign multiple layout maps per directory, hashing a file first to a map and then to a position within that map. This would provide practically the same load-spreading advantage of the other approaches, with only slight change to existing (and working) code.