request for comments

Anand Avati <avati@xxxxxxxxxxxxx> · Tue, 1 May 2007 09:05:28 -0700

here is a design proposal about some changes to afr and related.
currently AFR is totally handled on the client side, where the client
does the replication as well as failover. the AFR translator
essentially is doing _two_ features - 1. replication 2. failover. 

In view of the recent race condition discussed about AFR in the mailing
list (two clients writing to the same region running into a race while
writing to second mirror) and for other benefits mentioned below, the
proposal is to split replication and failover into two seperate
translators. replication is meant to be loaded on the server side
while failover alone is meant to be loaded on the client side.

imagine grouping your storage cluster into pairs or triplets or
quadriplets. the AFR translator will be loaded to form these groups,
but on the server side. each memeber of the (say) triplet will load
AFR with one child as the storage/posix and the other two children as
protocol/clients for the auxillary export of the remaining two
servers. thus the effect is,

* when you write to one server, it goes to all the three (redundancy)
* and, you can write via any server (used for failover)

under normal situation, the failover at client uses 'primary child'
(the non-auxillary export server) and opeartions are performed only on
that child. the server side takes care of replication. when the server
goes down failover detects broken link and uses the aux export.

advantages:

1. since a file is replicated by a signle agent, no potential race
conditions (most important)

2. the failover abstraction works for nonAFR scenarios also. you can
use the failover translator to failover between two network links to
the same server. (generally use infiniband, but failover to gigabit
totally seemlessly, even preserving open FDs)

3. client writes to only one server, tremendous saving of bandwidth
on the link between client and server.

4. self-heal checks can be performed in a more deterministic manner
since it is done by the 'primary chld' server. there are no
questions like 'what if two children try to heal together' or 'what if
no client is mounted at all'

5. extensions to AFR (like very-lazy replication, on close()) will be
lot easier. client submits a write to any server and forgets.

6. possible to implment 'transaction replay' kind of features easier
by preserving unwritten write() data with offset etc. on the server itslef
(doing such things with AFR on the client is unreliable since client can
always umount off)

7. on client side failover is not the only way, even 'loadbalance'
translator will be a good choice (wich takes care of not scheduling
calls to the link which is down). thus AFR will work hand-in-hand with
failover and/or loadbalancing, howoever the user prefers. (ofcourse
the loadbalance will work with its own abstraction where you can use
it just to loadbalance network links (remember somebody asking this on
the mailing list))

my instinct tells me there are more advantages i can list if i think
over more.

i feel failover and loadbalancer as generic layer will add lot of
power and possiblity for creative use, and AFR leveraging on that fits
in overall nicely.

suggestions/comments ?

avati

-- 
ultimate_answer_t
deep_thought (void)
{ 
  sleep (years2secs (7500000)); 
  return 42;
}