Re: Suggestions

Gordan Bobic <gordan@xxxxxxxxxx> · Wed, 08 Jun 2011 16:22:45 +0100

Hans K. Rosbach wrote:
On Wed, 2011-06-08 at 12:34 +0100, Gordan Bobic wrote:
Hans K. Rosbach wrote:

-SCTP support, this might not be a silver bullet but it feels
[...]

 Features that might need glusterfs code changes:
[...]
  -Multihoming (failover when one nic dies)
How is this different to what can be achieved (probably much more 
cleanly) with NIC bonding?

NIC bonding is nice for a small network, but routed networks might
have advantages from this. This is not something I feel that I need,
but I am sure it would be an advantage for some other users. This could
possibly be of help in geo-replication setups for example.

Not sure what routedness has to do with this. If you need route failover 
this is probably something best done by having a HA/cluster service 
change the routing table accordingly.

-Ability to have the storage nodes autosync themselves.
 In our setup the normal nodes have 2x1Gbit connections while the
 storage boxes have 2x10Gbit connections, so having the storage
 boxes use their own bandwidth and resources to sync would be nice.
Sounds like you want server-side rather than client-side replication. 
You could do this by using afr/replicate on the servers, and export via 
NFS to the clients. Have failover handled as for any normal NFS server.

We have considered this, and might decide to go down this route
eventually, however it seems strange that this can not also be done
using the native client.

Is the current NFS wheel not quite round enough for you? ;)

The fact that each client writes to both servers is fine, but the
fact that the clients needs to do the re-sync work whenever the
storage nodes are out of sync (one of them rebooted for example)
seems strange and feels very unreliable especially since this is
a manual operation.

There is a plan C, though. You can make the servers also clients. You 
can then have a process that does "ls -laR" periodically or upon failure.

-An ability for the clients to subscribe to metadata updates for
 a specific directory would also be nice, so that it can cache that
 folders stats while working there and still know that it will not
 miss any changes. This would perhaps increase overhead in large
 clusters but could improve performance by a lot in clusters where
 several nodes work in the same folder (mail spool folder for example).
You have a shared mail spool on your nodes? How do you avoid race 
conditions on deferred mail?

Several nodes can deliver mails to the spool folder, and dedicated queue
runners will pick them up and deliver them to local and/or remote hosts.
I am not certain what race conditions you are referring to, but locking
should make sure no more than one queue runner touches the file at one
time. Am I missing something?

Are you sure your MTA applies locks suitably? I wouldn't bet on it. I 
would expect that most of them assume unshared spools. Also remember 
that locking is a _major_ performance bottleneck when it comes to 
cluster file systems. Multiple nodes doing locking and r/w in the same 
directory will have an inverse scaling impact on performance, especially 
on small I/O such as you are likely to experience on a mail spool.

If there is no file locking you will likely see non-deterministic 
multiple sending of mail, especially deferred mail. Depending on how 
your MTA produces mail spool file names, you may see non-deterministic 
silent clobbering, too, if it doesn't do parent directory locking on 
file creation/deletion.
If there is locking, you will likely see that the performance starts to 
reduce as you add more servers due to lock contention.

Gordan