Re: Gluster Recovery

Sebastien LELIEVRE <slelievre@xxxxxxxxxxxxxxxx> · Thu, 26 Apr 2007 15:23:00 +0200

Anand Avati a écrit :
>> Here is another scenario :
>> (..)
> 
> if you observe the spec files a bit carefully, you will oberve that
> the servers *never* communicate with each other. there is NO
> connection between server1 and server2. file replication featue is on
> the client side (afr translator is loaded on the client spec file).
> the client itself writes to both server1 and server2 simultaneously.
> 

My Bad ! Sorry ! Then, it could explain this :

On the glusterfs volume :
# wget
http://imgsrc.hubblesite.org/hu/db/2007/16/images/a/formats/full_jpg.jpg
100%  209,780,268  11:32:12 (130.69 KB/s) - `full_jpg.jpg' saved

On the local disk :
# wget
http://imgsrc.hubblesite.org/hu/db/2007/16/images/a/formats/full_jpg.jpg
100% 209,780,268 11:59:30 (4.05 MB/s) - `full_jpg.jpg' saved

(It's a Hubble jpeg image 29566 x 14321, have a look ;))

Volume is configured as follow (where X is '1' or '2') :

servers (192.168.28.5 and 192.168.28.6) :

# cat /etc/glusterfs/glusterfs-serverX.vol
volume brickX
  type storage/posix
  option directory /export
end-volume

volume trace
  type debug/trace
  subvolumes brickX
  option debug on
end-volume

volume serverX
  type protocol/server
  option transport-type tcp/server
  subvolumes brickX
  option auth.ip.brickX.allow 192.168.28.7
end-volume

client (192.168.28.7) :

# cat /etc/glusterfs/glusterfs-client.vol
volume client1
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.28.5
  option remote-subvolume brick1
end-volume

volume client2
  type protocol/client
  option transport-type tcp/client
  option remote-host 192.168.28.6
  option remote-subvolume brick2
end-volume

volume afr
  type cluster/afr
  subvolumes client1 client2
  option replicate *:2
end-volume

volume trace
  type debug/trace
  subvolumes afr
  option debug on
end-volume

So all machines are on the same 100MB network. It could not be a network
issue. All are PIV HT 3GHz, 1Gb RAM, 250Gb SATA Disk, EXT3 volumes

Question is : will the transfert speed decrease as the number of replica
will increase (let's say X bricks and *:X replicate) ?

> 
> 
>> And last but not least : let's now say that Client1 and Client2 run the
>> same service (= access the same data). What would happen ? (Isn't that
>> what you've called "split brain" ?)
> 
> two clients accessing the same data at the same time is perfectly
> safe. I do not see any problem here. or probably i did not understand
> your question correctly.
> 
We're in AFR.

Then, I'm giving an example to try to explain it clearly :

I have a volume with the text file "say_hello.txt". In it, you just have
the line :

"Your administrator says hello"

Now, client1 and client2 open the file.

A network failure occurs : client1 can only see server1 and client2 only
see server2 (easily possible depending to your network architecture)

client1 quickly adds a "hello!" line, saves and closes the file.
Client2 takes his time. He writes "Thanks dear administrator, I'm
client2 and I say hello to everyone who read this file"
client2 saves and closes it.

Network comes up again.

That was my question's scenario.

In the same way, let's take the same scenario, but without the network
failure. What will happen ? (Data was modified before another client
commits its modifications) Is that case let to the underlying FS (in
"my" case, ext3) or will it be taken care of my any lock mechanism ?

"2 clients" are for those two example "only", but what about the same
cases with n-clients ?

One typical (disastreous) scenario, that merges the two above, is when
an evil one gain access to one brick, disconnects it from the net (cable
unplug, or just service stop), modifies some data (injecting wrong data
into some files, for instance) and reconnects the brick.

The fsck mechanism will see that some files will would have been
modified later than the other stored (and, why not, currently accessed
by clients) and would try to commit the latest version (but still the
wrong one) to other bricks, am I wrong ?

I've fully understood the "power" of clients, and that's why I'm so
paranoïd about them. Since only them will have the cluster's "full view"
(knowing where is every bricks in their config) but seems to believe
they are "alone" with it (that's how I understood the system), I'm
really concerned about how they manage to work without disturbing
one-another.

Will there be a way to know if bircks are "synchronized" (same data
replicated everywhere), which one is not and how severly, etc. ? (Maybe
this will be included in the "server notification framework" translator?)

>> I have another scenario, but I think it's enough for now, don't you ?
> 
> more feedback is always apprecitated, please shoot.
>

Well, it concerns the clients again. Let's take back the scheme on the
first post :

             Server1
           /         \
Client1 ---           ---Client2
           \         /
             Server2

Client1 is afr, client2 is unify.

They share the same directories, the same files. Will there not be a
problem?
For this time being, I'm confident to say that every file created by
client2 will not be seen by client1 because they will not be replicated.
I'm aware that client2 will see every files Client1 has ever access twice.

And finally, here is a question I asked on IRC, I will try to develop it:

"In AFR mode, let's say that there is a client on every bricks. Will the
AFR translator make the clients write "local" and then replicate, or
will there be only one "write" node which will replicate to others ?"

The replication is parallel, it writes at the same time. Remember the
write performance I pasted at the begining, then. This would mean that
the client which is writing something on the volume will see its writing
slow down even if one of the bricks is on the same machine as it is. Am
I correct ?

You asked to shoot... ;)

More to come, I'm afraid

Enkahel

Sebastien LELIEVRE
slelievre@xxxxxxxxxxxxxxxx           Services to ISP
TBS-internet                   http://www.TBS-internet.com/