Re: Client side AFR race conditions?

Gordan Bobic <gordan@xxxxxxxxxx> · Tue, 06 May 2008 23:37:29 +0100

Kevan Benson wrote:
Gordan Bobic wrote:
Derek Price wrote:

I'm not saying I don't want to see a more robust solution for client 
side AFR, just that each configuration has it's place, and client 
side AFR isn't currently (and may never be) capable of serving a 
share that requires high data integrity.

As far as I can see, there is no practical difference in this regard 
between client and server-side AFR. Throw multiple clients at multiple 
servers, and you have the exact same problem.

Exactly my point.  The difference is that it's much easier to put a 
constraint on the clients as to which servers they are allowed to talk 
to.  Put a fail-over IP on the servers, and force all clients to address 
one server.  It's one step in the direction of eliminating race 
conditions.  It's not perfect, but it's far simpler (in my opinion) than 
GFS with it's complexity, and the trade-off may be worth it in certain 
circumstances.

I think the distributed "equal peers" approach is generally superior to 
single-master, multiple-slaves approaches. For a start, the 
master-failure condition can be dealt with much more gracefully when 
there is no discreet master. This also makes it more scalable and more 
redundant. Graceful degradation is an important aspect.

If you think fixing this current issue will solve your problems, 
maybe you haven't considered the implications of connectivity 
problems between some clients and some (not all) servers...  Add in 
some clients with slightly off timestamps and you might have some 
major problems WITHOUT any reboots.

Exactly what I'm thinking. But then we're back to a tightly coupled 
cluster FS design like GFS or OCFS2: implicit write locking, 
journalled metadata for replay, and quorum requirements. Just about 
the only thing that can be sanely avoided is mandatory fencing, and 
that only because there is no shared storage, so one node going nuts 
cannot trash the entire FS after the other nodes boot it out.

If anybody can come up with an idea of how to achieve guaranteed 
consistency across all nodes without the above, I'd love to hear about 
it.

Am I getting this straight?  Even with server-side AFR, you get 
mirrors, but if all the clients aren't talking to the same server 
then there is no forced synchronization going on?  How hard would it 
be to implement some sort of synchronization/locking layer over AFR 
such that reads and writes could still go to the nearest (read: 
fastest) possible server yet still be guaranteed to be in sync?

You'd need global implicit write locks.

In other words, the majority of servers would know of new version 
numbers being written anywhere and yet reads would always serve local 
copies (potentially after waiting for synchronization).

Unlocked files can always be read. Read-locked files can always be 
read. Write-locked files can, in theory, be neither read nor written 
until unlocked, because there is no guaranteed consistency. The write 
lock also cannot be released until the metadata is journalled and all 
the nodes have acknowledged the write back.

The problem with this is that you still have to verify on every read 
that there are no current write-locks in place. With strong quorum 
requirements and write-lock synchronisation, this could potentially be 
done away with, though, if you can guarantee that all connected nodes 
will always be aware of all write locks, and can acknowledge them 
before the lock is granted. This would mean, theoretically, NFS / 
local read performance, with an unavoidable write overhead. But at 
least this is not too bad because under normal circumstances there 
will be some orders of magnitude more reads than writes.

The application I'm thinking of is virtualized read/write storage.  
For example, say you want to share some sort of data repository with 
offices in Europe, India, and the U.S. and you only have slow links 
connecting the various offices.  You would want all client access to 
happen against a local mirror, and you would want to restrict traffic 
between the mirrors to that absolutely required for locking and data 
synchronization.

The data transfers are already minimized if you have server-side AFR 
set up between sites (one mirror server on each site, with multiple 
clients at each site only connecting to the local server).

The only thing I'm not quite sure of in this model is what to do if 
the server processing a write operation crashes before the write 
finishes. I wouldn't want reads against the other mirrors to have to 
wait indefinitely for the crashed server to return, so the best I can 
come up with is that "write locks" for any files that hadn't been 
mirrored to at least one available server before a crash would need 
to be revoked on the first subsequent attempted access of the 
unsynchronized file.  Then when the crashed server came back up and 
tried to synchronize, it would find that its file wasn't the current 
version and sync in the other direction.

You heartbeat node status between the nodes. When a node drops out, 
the other nodes after a few seconds' grace period, boot it out, and 
release all it's locks. Unless a node has a lock, it's journal gets 
discarded on writes, so it cannot commit. Note that this means both 
file and directory metadata journalling, and they need to be combined 
in case of deleting and re-creating the same file name. If a file gets 
deleted, it should be safe to just say that file was deleted at 
version X in the directory metadata journal. All previous versions in 
the journal can be discarded, and the file metadata journal can also 
be reset to free up space. Until all connected nodes have acknowledged 
the journal commit, the file and directory metadata cannot be discarded.

One exception could be where the number of file resync journal entries 
exceeds it's storage limit, and we mark the file for full resync. The 
only tricky part then is dynamically adding and removing nodes from 
the cluster, but that can be solved in a reasonably straightforward 
way, by on-line quorum adjustments.

But the important point is that this will likely require a lot of 
time, effort and thought to implement. :-(

Thus the complexity of the other cluster file systems out there.

Indeed - it isn't a simple problem, but GlusterFS is interesting by the 
fact that it is distributed and more scaleable than similar solutions. 
It just seems to be in need of inheriting some of the features from the 
other, more complex cluster file systems to give it the ability to 
ensure consistency across the distributed data store. It turns out that 
consistency and incremental updates for fast syncing are not issues as 
separate as they may have originally appeared - both can be solved by 
journaling.

Thanks for the fairly in-depth assessment of the problems.  That should 
clear the air a bit.

Glad I can help. In case you couldn't tell, I've been up to my eyeballs 
in various cluster file systems for quite a while now.

Gordan