Re: Re; Load balancing ...

"Krishna Srinivas" <krishna@xxxxxxxxxxxxx> · Mon, 28 Apr 2008 14:13:09 +0530

Gareth:
    Moving to the other end of the scale, AFR can't cope with large
files either .. handling of sparse files doesn't work properly and
self-heal has no concept of repairing part of a file .. so sticking a
20Gb file on a GlusterFS is just asking for trouble as every time you
restart a gluster server (or every time one crashes) it'll crucify
your network.

Gordan:

I thought about this, and there isn't really a way to do anything
about this, unless you relax the constraints. You could to a
rsync-type rolling checksum block-sync, but this would both take up
more CPU time and result in theoretical scope for the file to not be
the same on both ends. Whether this minute possibility of corruption
that the hashing algorithm doedn't pick up is a reasonable trade-off,
I don't know. Perhaps if such a thing were implemented it should be
made optional.

Krishna:
We have plans to provide rsync-type sync to AFR sync in future. Giving
it as option as Gordon mentioned.

Gareth:
a. With regards to metadata, given two volumes mirrored via AFR, please can you
 explain to me why it's ok to do a data read operation against one
 node only, but not a metadata read operation .. and what would break
 if you read metadata from only one volume?

Gordon:

The fact that the file may have been deleted or modified when you try
to open it. File's content is a feature of the file. Whether the file
is there and/or up to date is a feature of the metadata of the file
and it's parent directory. If you start loosening this, you might as
well disconnect the nodes and run them in a deliberate split-brain
case and resync periodically with all the conflict and data loss that
entails.

Krishna:

when we do data read op we are already sure that the file is in sync,
but when we do metadata reads (using lookup) we are not sure if they
are in sync.

Gordon is right here. Selfhealing on the fly is very much dependant on
lookup(). So it is inevitable to do lookup() on all the subvols. Also
we use the results of lookup() call for subsequent operations on that
file/directory. But it is not a bad idea to compromise consistency for
speed (with read-subvolume option) as some users might prefer that. We
can provide this as an option and let admins handle the
inconsistancies that would arise of this compromise. We shall keep
this in the TODO list.

Gareth:
b. Looking back through the list, Gluster's non-caching mechanism for
  acquiring file-system information seems to be at the root of many of
  it's performance issues. Is there no mileage in trying to address
  this issue ?

Krishna:
Have you tried -a -e options on the client side?

Gareth:
c. If I stop one of my two servers, AFR suddenly speeds up "a lot" !
   Would it be so bad if there were an additional option "subvolume-read-meta" ?
   This would probably involve only a handful of additional lines of
code, if that .. ?

Krishna:
This is related to the issue "a" we have discussed.

Also, some users have succedded booting off glusterfs. (this was
discussed in one of the mails)

Regards,
Krishna

On Sat, Apr 26, 2008 at 3:51 AM, Gareth Bult <gareth@xxxxxxxxxxxxx> wrote:
> >You're expecting a bit much here - for any shared/clustered FS. DRBD
>  >might come close if your extents are big enough, but that's a whole
>  >different ball game...
>
>  I was quoting a real-world / live data scenario, DRBD handles it just fine.
>  .. but it is a different mechanism to gluster.
>
>
>  >Sounds like a reasonably sane solution to me.
>
>  It is. It also makes Gluster useless in this scenario.
>
>
>  >Why would the cluster effectively be down? Other nodes would still be
>  >able to server that file.
>
>  Nope, it won't replicate the file while another node has it locked .. which means you effectively need to close all files in order to kick off the replication process, and the OPEN call will not complete until the file has replicated .. so effectively (a) you need to restart all your processes to make then close and re-open their files (or HUP them.. or whatever), then those processes will all freeze until the files they are trying to open have replicated.
>
>
>  >Or are you talking about the client-side AFR?
>
>  Mmm, it's been a while, I'm not entirely sure I've tested the issue on client side and server side.
>  Are you telling me that server-side will work quite happily and it's only client-side that has all these issues?
>
>
>  >I have to say, a one-client/multiple-servers scenario sounds odd.
>  >If you don't care about downtime (you have just one client node so that's
>  >the only conclusion that can be reached), then what's the problem with a bit more downtime?
>
>  My live scenario was 4 (2x2) AFR servers with ~ 12 clients.
>
>  Obviously this setup is no longer available to me as it proved to be useless in practice.
>
>  I'm currently revisiting Gluster with another "new" requirement (as per my last email) .. currently I'm testing a 2 x server + 1 x client setup with regards to load balancing and use over a slow line. Obviously (!) both servers can also act as clients so I guess to be pedantic you'd call it 2 servers + 3 clients. My point was I have 1 machine with no server.
>
>
>  Gareth.
>
>  --
>  Managing Director, Encryptec Limited
>  Tel: 0845 5082719, Mob: 0785 3305393
>  Email: gareth@xxxxxxxxxxxxx
>  Statements made are at all times subject to Encryptec's Terms and Conditions of Business, which are available upon request.
>
>
> ----- Original Message -----
>  From: "Gordan Bobic" <gordan@xxxxxxxxxx>
>  To: gluster-devel@xxxxxxxxxx
>
> Sent: Friday, April 25, 2008 9:40:00 PM GMT +00:00 GMT Britain, Ireland, Portugal
>  Subject: Re: Re; Load balancing ...
>
>  Gareth Bult wrote:
>
>
> >> If you have two nodes and the 20 GB file
>  >> only got written to node A while node B was down and
>  >> node B comes up the whole 20 GB is resynced to node B;
>  >> is that more network usage than if the 20 GB file were
>  >> written immediately to node A & node B.
>  >
>  > Ah. Let's say you have both nodes running with a 20Gb file synced.
>  > Then you have to restart one glusterfs on one of the nodes.
>  > While it's down, let's say the other node appends 1 byte to the file.
>  > When it comes back up and looks a the file, the other node will see it's out of date and re-copy the entire 20Gb.
>
>  You're expecting a bit much here - for any shared/clustered FS. DRBD
>  might come close if your extents are big enough, but that's a whole
>  different ball game...
>
>  >> Perhaps the issue is really that the cost comes at an
>  >> unexpected time, on node startup instead of when the
>  >> file was originally written?  Would a startup
>  >> throttling mechanism help here on resyncs?
>  >
>  > Yes, unfortunately you can't open a file while it's syncing .. so when you reboot your gluster server, downtime is the length of time it takes to restart glusterfs (or the machine, either way..) PLUS the amount of time it takes to recopy every file that was written to while one node was down ...
>
>  Sounds like a reasonably sane solution to me.
>
>  > Take a Xen server for example serving disk images off a gluster partition.
>  > 10 Images at 10G each gives you a 100G copy to do.
>
>  If they are static images why would they have changed? What you are
>  describing would really be much better accomplished with a SAN+GFS or
>  Coda which is specifically designed to handle disconnected operation at
>  the expense of other things.
>
>  > Wait, it get's better .. it will only re-sync the file on opening, so you actually have to close all the files, then try to re-open them , then wait while it re-syncs the data (during this time your cluster is effectively down), then the file open completes and you are back up again.
>
>  Why would the cluster effectively be down? Other nodes would still be
>  able to server that file. Or are you talking about the client-side AFR?
>  I have to say, a one-client/multiple-servers scenario sounds odd. If you
>  don't care about downtime (you have just one client node so that's the
>  only conclusion that can be reached), then what's the problem with a bit
>  more downtime?
>
>  > Yet there is a claim in the FAQ that there is no single point of failure .. yet to upgrade gluster for example you effectively need to shut down the entire cluster in order to get all files to re-sync ...
>
>  Wire protocol incompatibilities are, indeed unfortunate. But on one hand
>  you speak of manual failover and SPOF clients and on the other you speak
>  of unwanted downtime. If this bothers you, have enough nodes that you
>  could shut down half (leaving half running), upgrade the downed ones,
>  bring them up and migrade the IPs (heartbeat, RHCS, etc) to the upgraded
>  ones and upgrade the remaining nodes. The downtime should be seconds at
>  most.
>
>  > Effectively storing anything like a large file on AFR is pretty unworkable and makes split-brian issues pale into insignificance ... or at least that's my experience of trying to use it...
>
>  I can't help but think that you're trying to use the wrong tool for the
>  job here. A SAN/GFS solution sounds like it would fit your use case better.
>
>
>
> Gordan
>
>
>  _______________________________________________
>  Gluster-devel mailing list
>  Gluster-devel@xxxxxxxxxx
>  http://lists.nongnu.org/mailman/listinfo/gluster-devel
>
>
>  _______________________________________________
>  Gluster-devel mailing list
>  Gluster-devel@xxxxxxxxxx
>  http://lists.nongnu.org/mailman/listinfo/gluster-devel
>