Hello, I think what you're describing is functionally equivalent to the Google File System; http://labs.google.com/papers/gfs.html http://en.wikipedia.org/wiki/Google_File_System And, by comparable design, Hadoop File System; http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html Google/Hadoop FS framed a set of principles that I (and a friend of mine - Michael Welton) were hoping to achieve by moving to GlusterFS - and I think it will still get us all the way there once we get IPSEC underneath it and it deals with WANs better. In that context, I'm a fan already :) However, I would like to add that if you're going to go to the trouble of providing a hashed-block abstraction layer, that you make it pluggable. I.e. before I was taken ill in September 2009 I was looking at implementing a de-dupe block for GlusterFS. Another friend (Chris Kelada) and I came up with a full de-dupe model one evening which I thought was quite simple. What stalled me before I fell ill was an issue with descriptors that may have been fixed during the BSD porting activity that I saw on this list a week or two ago. De-dupe needs a hash for a chunk of data. What we proposed was an arbitrary sized chunk (say 1Mbyte) hashed, then stored in the underlying layer via its hash - i.e. if my block hashed to "thisismyhash" then it would be stored in /.dedupe/thi/sis/myh/ash/data.bin with provision for an index to allow multiple blocks of the same hash to co-exist at the same location (i.e. either n x chunk_size offset into the data file, or n data files). This would have a high read/write performance penalty (lots of iops in the underlying layer per iop in the de-dupe layer) but it reaps the obvious space benefit. Thus if a data hashing block were to be forged, its re- application could be quite broad - but only if considered beforehand; then I'd be all for it. Cheers, ----- Original Message ----- >From: "Daniel van Ham Colchete" <daniel.colchete@xxxxxxxxx> >To: "glusterfs" <gluster-devel@xxxxxxxxxx> >Subject: DHT-based Replication >Date: Mon, 21 Nov 2011 21:34:30 -0200 > > Good day everyone! > > It is a pleasure for me to write to this list again, after so many years. I > tested GlusterFS a lot about four years ago because of a project I still > have today. It was 2007 and I worked with Gluster 1.2 and 1.3. I like to > think that I gave one or two good ideas that were used in the project. Amar > was a great friend back then (friend is the correct word as he was on my > Orkut's friends list). Unfortunately Gluster's performance wasn't good > enough, mostly because of the ASCII protocol and I had to go to the HA-NFS > way. > > By that time, a new clustering translator was created, it was called DHT. > In those four years that passed I studied DHT and even implemented a > DHT-based internal system here at my company to create a PostgreSQL cluster > (I didn't do any reconciliation thing though, that is the hard part for > me). It worked fine, it still does! > > What is the beauty of DHT? It is easy to find where a file or a directory > is in the cluster. It is just a hash operation! You just create a cluster > of N servers and everything should be distributed evenly through them. > Every player out there got the distribution part of storage clusters right: > GlusterFS, Cassandra, MongoDB, etc. > > What is not right in my opinion is how everybody does replication. > Everything works in pairs (or triples, or replica sets of any number). The > problem with pairs (and replica sets) is that in a 10 server cluster, if > one fails only its pair will have to handle the double of the usual load, > while the other ones will be working with the usual load. So we can assign > at most 50% of its capacity to any storage server. Only 50%. 50% efficiency > means we spend the double in hardware, rack space and energy (the worst > part). Even with RAID10 storage, the ones we use, have this problem too. If > a disk dies, one will get double read IOs while the other ones are 50% idle > (or should be). > > So, I have a suggestion that fixes this problem. I call it DHT-based > replication. It is different from DHT-based distribution. I already > implemented it internally, it already worked, at least here. Giving the > amount of money and energy this idea saves, I think this idea is worth a > million bucks (on your hands at least). Even though it is really simple. > I'm giving it to you guys for free. Just please give credit if you are > going to use it. > > It is very simple: hash(name) to locate the primary server (as usual), and > use another hash, like hash(name + "#copy2") for the second copy and so on. > You just have to certify that it doesn't fall into the same server, but > this is easy: hash(name + "#copy2/try2"). > > So, say we have 11 storage servers. Yeah, now we can have a prime number of > servers and everything will still work fine. Imagine that storage server > #11 dies. The secondary copy of all files at #11 are spread across all the > other 10 servers, so those servers are now only getting a load 10% bigger! > Wow! Now I can use 90% of what my storages can handle and still be up when > one fails! Had I done this the way things are today, my system would be > down because there is no way a storage can handle 180% of IO its capacity > (by definition). So now I have 90% efficiency! For exactly the same costs > to have 50k users I can have 90k users now! That is a 45% costs savings on > storage, with just a simple algorithm change. > > So, what are your thoughts on the idea? > > Thank you very much! > > Best regards, > Daniel Colchete > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > https://lists.nongnu.org/mailman/listinfo/gluster-devel > -- Ian Latter Late night coder .. http://midnightcode.org/