Re: IP Failover

John Klingler <john@xxxxxxxxxxx> · Tue, 07 Oct 2003 12:55:56 -0700

 If anyone is interested,
in my quest for a networking solution which
provides IP Failover on heterogenous redundant networks, I have listed
the solutions I found below. I would welcome comments from anyone who
is familiar with these.

  faild - I have included a description below of a
program
daemon which monitors the Ethernet connections
and changes the routing tables when a failure is detected. IP Failover
is all this simple program does. Being simple, however, makes it small
and easy to port. 

  High Availability Linux Project (HAL)
(http://linux-ha.org/)
has code available for FreeBsd and Solaris (and probably reasonably
portably to other UNIX platforms. It supports virtual (redundant)
servers but could probably therefore be configured to support redundant
LANs.

  Advanced
Network Services (ANS 2.3.x) for Linux*
Operating Systems.  which is available from Intel on both PCs
and UNIX OS's. ANS provides IP Failover and much more, such as switch
failover, load leveling, etc. See: http://www.intel.com/support/network/adapter/onlineguide/PRO1000/DOCS/SERVER/index.htm.

  Linux Virtual Server Project (LVS) - VRRPD,
Virtual Router Redundancy Protocol (http://off.net/~jme/vrrpd/)
which also provides IP Failover. It implements RFC2338 but is only
available on Linux but may be portable. As with HAL, it is probably
configureable to provide redundant LAN. 

It seems the days of industry-wide standards and interoperability are
becoming casualties of war. 

John Klingler

Automatic IP Failover: faild 

Figure 1 shows a typical redundant network configuration where all
nodes are connected to two, separate Ethernet LANs (here referred to as
Ethernet A and Ethernet B). Each node must have two Ethernet
interfaces, one for each LAN. Distinct IP addresses are assigned to all
Ethernet interfaces. 

          _____________________ . . .

        |                     |

    Host 1           Host 2

____|________ __|______ . . .

          Figure 1: Typical Redundant Network
Configuration 

A route monitor daemon is started on all nodes. Each daemon is
configured to be either a responder or both a requestor and responder.
Typically the host daemons are requestor/responders. 

Requestor daemons broadcast inquiry (INQ) packets on all available
networks at a specified interval. Upon receiving an INQ each responder
daemon sends back an acknowledgment (ACK) via the same route. These
packets are all sent using UDP (Unreliable Datagram Protocol) so the
daemons can quickly detect if a route is active. 

If the requestor daemon does not get ACKs from a given node and if the
responder daemon does not get INQs as expected, then each daemon
independently determines that the particular route has become
unreliable, or more likely, has gone dead. Each daemon then changes its
local system routing tables so future traffic will be routed over the
alternate (and presumably healthy) LAN. This detection and failover
occurs very quickly, in a matter of a few seconds, depending on how the
daemon's timing parameters are set. 

When a route fails, network traffic carried by reliable protocols (such
as X Window traffic via TCP -- Transmission Control Protocol) is held
in abeyance until the IP stack recognizes that packets are not getting
through. When the IP stack times out packets waiting for delivery will
be retransmitted. Since the daemon has changed the routing tables the
retransmitted packets will go via the new route. 

The IP time-out time is the critical parameter determining how long it
will take from initial route failure to establishing communication over
the new route. This parameter may or may not be user-settable on your
system. Field experience so far indicates lag times of 20-40 seconds
before communication resumes. 

As soon as the original route becomes reliable again, the daemon will
restore the routing tables and communication resumes over the original
interface. There should be no noticeable delay on the switchback.
Request packet interval, failover interval, and switchback interval are
all configurable. 

To initiate a failover daemon on your host system, use the following
convention: 

faild [-r] [-t <n>] [-f <n>] [-s <n>] [-p
<n>] [-l <p>] 

-r should launch requestor 

-t <n> : timer interval (in secs) for sending of pkts 

-f <n> : num missed pkts before if is invalidated 

-s <n> : num good pkts before if is revalidated 

-p <n> : port number to use -l <p> : full path to message
log file 

  Note: This daemon currently runs on VxWorks, Digital UNIX
and Solaris, and is being ported to OpenVMS. Any other platforms would
require porting the daemon to the target OS. 

Re: IP Failover

Linux Advanced Routing and Traffic Control