here is a design proposal about some changes to afr and related. currently AFR is totally handled on the client side, where the client does the replication as well as failover. the AFR translator essentially is doing _two_ features - 1. replication 2. failover. In view of the recent race condition discussed about AFR in the mailing list (two clients writing to the same region running into a race while writing to second mirror) and for other benefits mentioned below, the proposal is to split replication and failover into two seperate translators. replication is meant to be loaded on the server side while failover alone is meant to be loaded on the client side. imagine grouping your storage cluster into pairs or triplets or quadriplets. the AFR translator will be loaded to form these groups, but on the server side. each memeber of the (say) triplet will load AFR with one child as the storage/posix and the other two children as protocol/clients for the auxillary export of the remaining two servers. thus the effect is, * when you write to one server, it goes to all the three (redundancy) * and, you can write via any server (used for failover) under normal situation, the failover at client uses 'primary child' (the non-auxillary export server) and opeartions are performed only on that child. the server side takes care of replication. when the server goes down failover detects broken link and uses the aux export. advantages: 1. since a file is replicated by a signle agent, no potential race conditions (most important) 2. the failover abstraction works for nonAFR scenarios also. you can use the failover translator to failover between two network links to the same server. (generally use infiniband, but failover to gigabit totally seemlessly, even preserving open FDs) 3. client writes to only one server, tremendous saving of bandwidth on the link between client and server. 4. self-heal checks can be performed in a more deterministic manner since it is done by the 'primary chld' server. there are no questions like 'what if two children try to heal together' or 'what if no client is mounted at all' 5. extensions to AFR (like very-lazy replication, on close()) will be lot easier. client submits a write to any server and forgets. 6. possible to implment 'transaction replay' kind of features easier by preserving unwritten write() data with offset etc. on the server itslef (doing such things with AFR on the client is unreliable since client can always umount off) 7. on client side failover is not the only way, even 'loadbalance' translator will be a good choice (wich takes care of not scheduling calls to the link which is down). thus AFR will work hand-in-hand with failover and/or loadbalancing, howoever the user prefers. (ofcourse the loadbalance will work with its own abstraction where you can use it just to loadbalance network links (remember somebody asking this on the mailing list)) my instinct tells me there are more advantages i can list if i think over more. i feel failover and loadbalancer as generic layer will add lot of power and possiblity for creative use, and AFR leveraging on that fits in overall nicely. suggestions/comments ? avati -- ultimate_answer_t deep_thought (void) { sleep (years2secs (7500000)); return 42; }