DHT-based Replication

Daniel van Ham Colchete <daniel.colchete@xxxxxxxxx> · Mon, 21 Nov 2011 21:34:30 -0200

Good day everyone!
It is a pleasure for me to write to this list again, after so many years. I tested GlusterFS a lot about four years ago because of a project I still have today. It was 2007 and I worked with Gluster 1.2 and 1.3. I like to think that I gave one or two good ideas that were used in the project. Amar was a great friend back then (friend is the correct word as he was on my Orkut's friends list). Unfortunately Gluster's performance wasn't good enough, mostly because of the ASCII protocol and I had to go to the HA-NFS way. 

By that time, a new clustering translator was created, it was called DHT. In those four years that passed I studied DHT and even implemented a DHT-based internal system here at my company to create a PostgreSQL cluster (I didn't do any reconciliation thing though, that is the hard part for me). It worked fine, it still does!

What is the beauty of DHT? It is easy to find where a file or a directory is in the cluster. It is just a hash operation! You just create a cluster of N servers and everything should be distributed evenly through them. Every player out there got the distribution part of storage clusters right: GlusterFS, Cassandra, MongoDB, etc.

What is not right in my opinion is how everybody does replication. Everything works in pairs (or triples, or replica sets of any number). The problem with pairs (and replica sets) is that in a 10 server cluster, if one fails only its pair will have to handle the double of the usual load, while the other ones will be working with the usual load. So we can assign at most 50% of its capacity to any storage server. Only 50%. 50% efficiency means we spend the double in hardware, rack space and energy (the worst part). Even with RAID10 storage, the ones we use, have this problem too. If a disk dies, one will get double read IOs while the other ones are 50% idle (or should be).

So, I have a suggestion that fixes this problem. I call it DHT-based replication. It is different from DHT-based distribution. I already implemented it internally, it already worked, at least here. Giving the amount of money and energy this idea saves, I think this idea is worth a million bucks (on your hands at least). Even though it is really simple. I'm giving it to you guys for free. Just please give credit if you are going to use it. 

It is very simple: hash(name) to locate the primary server (as usual), and use another hash, like hash(name + "#copy2") for the second copy and so on. You just have to certify that it doesn't fall into the same server, but this is easy: hash(name + "#copy2/try2").

So, say we have 11 storage servers. Yeah, now we can have a prime number of servers and everything will still work fine. Imagine that storage server #11 dies. The secondary copy of all files at #11 are spread across all the other 10 servers, so those servers are now only getting a load 10% bigger! Wow! Now I can use 90% of what my storages can handle and still be up when one fails! Had I done this the way things are today, my system would be down because there is no way a storage can handle 180% of IO its capacity (by definition). So now I have 90% efficiency! For exactly the same costs to have 50k users I can have 90k users now! That is a 45% costs savings on storage, with just a simple algorithm change.

So, what are your thoughts on the idea?

Thank you very much!

Best regards,
Daniel Colchete