On Tuesday 29 September 2009 @ 18:41, Bron Gondwana wrote: > > Possibly the secret is that we use IPAddr2 from linux-ha to force > ARP flushes, and we transfer the primary IP address between > machines, so nothing else needs to know - we just shut down one end > and bring up the other with the IP and it's all good. Our primaries and replicas are located in different data centers, and since we have not control over how the network is subdivided it's impossible for them to take the same IPs. > > Our process is: > > a) check there are less than 10kb of files in $conf/sync/ - else > abort b) shut down the master (host A) > c) run sync_client -f $file on each file in $conf/sync (if any) > c2) (if any sync fails, restart the master (host A)) > d) shut down the replica (host B) > e) update the database with the new master location > f) start up the replica (host A) > g) start up the master (host B) > > This means replication starts immediately, because the replica is > already there when the master starts. So you just immediately start replicating back to a host (or site) that just failed? How does that work? We have a third level of machines that we sync to, in an out of band process, but the data is stored exactly the same way so we can start replicating to them immediately. So even if a entire data center failed, we can still be running a fully replicated service with almost no downtime visible to users. Brian ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html