You see, there isn't much why we shouldn't do it! There is the Google folks and Hadoop folks having this kind of efficiency and we dont! With 100 storages you can have N+5 redundancy and still be 94,7% IO efficient, considering the big clusters out there.
Having that at GlusterFS would mean a stable high performance filesystem getting much more efficient.
Best!
On Tue, Nov 22, 2011 at 4:25 AM, Ian Latter <ian.latter@xxxxxxxxxxxxxxxx> wrote:
Hello,
I think what you're describing is functionally equivalent
to the Google File System;
http://labs.google.com/papers/gfs.html
http://en.wikipedia.org/wiki/Google_File_System
And, by comparable design, Hadoop File System;
http://hadoop.apache.org/common/docs/r0.20.0/hdfs_design.html
Google/Hadoop FS framed a set of principles that I
(and a friend of mine - Michael Welton) were hoping to
achieve by moving to GlusterFS - and I think it will still
get us all the way there once we get IPSEC underneath
it and it deals with WANs better.
In that context, I'm a fan already :)
However, I would like to add that if you're going to go
to the trouble of providing a hashed-block abstraction
layer, that you make it pluggable. I.e. before I was taken
ill in September 2009 I was looking at implementing a
de-dupe block for GlusterFS. Another friend (Chris
Kelada) and I came up with a full de-dupe model one
evening which I thought was quite simple. What stalled
me before I fell ill was an issue with descriptors that may
have been fixed during the BSD porting activity that I saw
on this list a week or two ago.
De-dupe needs a hash for a chunk of data. What we
proposed was an arbitrary sized chunk (say 1Mbyte)
hashed, then stored in the underlying layer via its hash -
i.e. if my block hashed to "thisismyhash" then it would be
stored in /.dedupe/thi/sis/myh/ash/data.bin with provision
for an index to allow multiple blocks of the same hash to
co-exist at the same location (i.e. either n x chunk_size
offset into the data file, or n data files).
This would have a high read/write performance penalty
(lots of iops in the underlying layer per iop in the de-dupe
layer) but it reaps the obvious space benefit.
Thus if a data hashing block were to be forged, its re-
application could be quite broad - but only if considered
beforehand; then I'd be all for it.
Cheers,
> _______________________________________________>To: "glusterfs" <gluster-devel@xxxxxxxxxx>
>Subject: DHT-based Replication
>Date: Mon, 21 Nov 2011 21:34:30 -0200
>
> Good day everyone!
>
> It is a pleasure for me to write to this list again, after
so many years. I
> tested GlusterFS a lot about four years ago because of a
project I still
> have today. It was 2007 and I worked with Gluster 1.2 and
1.3. I like to
> think that I gave one or two good ideas that were used in
the project. Amar
> was a great friend back then (friend is the correct word
as he was on my
> Orkut's friends list). Unfortunately Gluster's performance
wasn't good
> enough, mostly because of the ASCII protocol and I had to
go to the HA-NFS
> way.
>
> By that time, a new clustering translator was created, it
was called DHT.
> In those four years that passed I studied DHT and even
implemented a
> DHT-based internal system here at my company to create a
PostgreSQL cluster
> (I didn't do any reconciliation thing though, that is the
hard part for
> me). It worked fine, it still does!
>
> What is the beauty of DHT? It is easy to find where a file
or a directory
> is in the cluster. It is just a hash operation! You just
create a cluster
> of N servers and everything should be distributed evenly
through them.
> Every player out there got the distribution part of
storage clusters right:
> GlusterFS, Cassandra, MongoDB, etc.
>
> What is not right in my opinion is how everybody does
replication.
> Everything works in pairs (or triples, or replica sets of
any number). The
> problem with pairs (and replica sets) is that in a 10
server cluster, if
> one fails only its pair will have to handle the double of
the usual load,
> while the other ones will be working with the usual load.
So we can assign
> at most 50% of its capacity to any storage server. Only
50%. 50% efficiency
> means we spend the double in hardware, rack space and
energy (the worst
> part). Even with RAID10 storage, the ones we use, have
this problem too. If
> a disk dies, one will get double read IOs while the other
ones are 50% idle
> (or should be).
>
> So, I have a suggestion that fixes this problem. I call it
DHT-based
> replication. It is different from DHT-based distribution.
I already
> implemented it internally, it already worked, at least
here. Giving the
> amount of money and energy this idea saves, I think this
idea is worth a
> million bucks (on your hands at least). Even though it is
really simple.
> I'm giving it to you guys for free. Just please give
credit if you are
> going to use it.
>
> It is very simple: hash(name) to locate the primary server
(as usual), and
> use another hash, like hash(name + "#copy2") for the
second copy and so on.
> You just have to certify that it doesn't fall into the
same server, but
> this is easy: hash(name + "#copy2/try2").
>
> So, say we have 11 storage servers. Yeah, now we can have
a prime number of
> servers and everything will still work fine. Imagine that
storage server
> #11 dies. The secondary copy of all files at #11 are
spread across all the
> other 10 servers, so those servers are now only getting a
load 10% bigger!
> Wow! Now I can use 90% of what my storages can handle and
still be up when
> one fails! Had I done this the way things are today, my
system would be
> down because there is no way a storage can handle 180% of
IO its capacity
> (by definition). So now I have 90% efficiency! For exactly
the same costs
> to have 50k users I can have 90k users now! That is a 45%
costs savings on
> storage, with just a simple algorithm change.
>
> So, what are your thoughts on the idea?
>
> Thank you very much!
>
> Best regards,
> Daniel Colchete
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxx
> https://lists.nongnu.org/mailman/listinfo/gluster-devel
>
--
Ian Latter
Late night coder ..
http://midnightcode.org/