How does replication work?

mark at mark.mielke.cc (Mark Mielke) · Tue, 08 Sep 2009 12:27:11 -0400

On 09/08/2009 04:14 AM, Daniel Maher wrote:
> Alan Ivey wrote:
>> Like the subject implies, how does replication work exactly?
>>
>> If a client is the only one that has the IP addresses defined for the 
>> servers, does that mean that only a client writing a file ensures 
>> that it goes to both servers? That would tell me that the servers 
>> don't directly communicate with each other for replication.
>>
>> If so, how does healing work? Since the client is the only 
>> configuration with the multiple server IP addresses, is it the 
>> client's "task" to make sure the server heals itself once it's back 
>> online?
>>
>> If not, how do they servers know each other exist if not for the 
>> client config file?
>
> You've answered your own question. :)  AFAIK, in the recommended 
> simple replication scenario, the client is actually responsible for 
> replication, as each server is functionally independant.
> (This seems crazy to me, but yes, that's how it works.)

For Alan: Active healing should only be necessary if the system is not 
working properly. Healing should only be required after a system crash 
or bug, a GlusterFS server or client crash or bug, or somebody messing 
around with the backing store file system underneath. For systems that 
are up and running without problems, healing should be completely 
unnecessary.

For Daniel: For the seems crazy, compared to what? Every time I look at 
other solutions such as Lustre and see how they rely on a single 
metadata server, that itself is supposed to be highly available using 
other means, I have to ask, are they really solving the highly 
availability problem, or are they just narrowing the scope? If the whole 
cluster of 2 to 1000 nodes is relying on a single server to being up, 
this is the weakest link. Sure, having one weakest link to deal with is 
easier to solve using traditional means that having 1000 weakest links, 
but it seems clear that Lustre has not SOLVED the problem. They've just 
reduced it to something that might be more manageable. Even the 
"traditional means" of shared disk storage such as GFS and OCFS rely on 
a single piece of hardware - the shared storage. As a result, they make 
the shared storage really expensive - dual interfaces, dual power 
supplies, dual disks, ... but it's still one piece of hardware that 
everything else is reliant on.

For "shared nothing", each node really does need to be fully independent 
and able to make its own decisions. I think the GlusterFS folk have the 
model right in this regard.

The remaining question is whether they have the *implementation* right. :-)

Right now they seem to be in a compromised position between simplicity, 
performance, and correctness. It seems it is a difficult problem to have 
all three no matter which model is selected (shared disk, shared 
metadata only, shared nothing). The self-healing is a good feature, but 
they seem to be leaning on it to provide correctness, so that they can 
provide performance with some amount of simplicity. An example here is 
how directory listings come from "the first up server". In theory, we 
could have correctness through self-healing if directory listing always 
queried all servers. The combined directory listing would be shown, and 
self healing would kick off in the back ground. But, this would cost 
performance - as all servers in the cluster would be involved in 
directory listing. This is just one example.

I think GlusterFS has a lot of potential to close off on holes such as 
these. I don't think it would be difficult to add in things like an 
automatic election model for defining which machines are considered 
stable and the safest masters to use (simplest might be 'the one with 
the highest glusterfsd uptime'?), and having clients choose to pull 
things like directory listings only from the first stable / safest 
master, and having the non-stable / non-safe machines go into automatic 
full self-heal until they are back up-to-date with the master. In such a 
model, I'd like to see the locks being held against the stable/safe 
masters used for reads. Just throwing stuff out there...

For me, I'm looking at this as - I have a problem to solve, and very few 
solutions seem to meet my requirements. GlusterFS looks very close. Do I 
write my own, which would probably start out only solving my 
requirements, and since my requirements will probably grow, this would 
mean eventually writing something the size of GlusterFS? Or do I start 
looking in to this GlusterFS thing - point out the problems, and see if 
I can help?

I'm leaning towards the latter - try it out, point out the problems, see 
if I can help.

As it is, I think GlusterFS is very stable with sufficient performance 
for the requirements of most potential users. It's the people who are 
really trying to push it to its limits that are causing the majority of 
the breakage being reported here. For these people, which includes me, 
I've looked around - and the solutions out there that are competitive 
are either very expensive, or insufficient.

Cheers,
mark

-- 
Mark Mielke<mark at mielke.cc>