Re: Running an active/active firewall/router (xt_cluster?)

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Tue, 11 May 2021 11:08:15 +0200

Am 11.05.21 um 03:00 schrieb Paul Robert Marino:
Well In the scenario where you don't control the upstream router I would recommend putting a small routing switch stack in the middle. The reason being it solves a lot of the potential hardware issues as well around redundancy and loadbalancing. Ideally I always like to see a separate routing switch stack on both sides that can only be managed by an OOB network on dedicated ports. 

We actually have a (non-redundant) routing switch on one end, and a the mentioned routing stack on the other end, but _both_ are not controlled by us (we could take control of one,
but since the operators maintain the full switch infrastructure and are very cooperative and experienced, we preferred to leave also this component to them for now).
So we'll definitely seek contact to the operators and try to convince them to save extra hardware (but since the infrastructure on one end is shared, this may need some discussion).

Back when I did this stuff on a regular large scale (managed hundreds of firewalls) basis I would use cheep Avaya (originally Nortel, now Extream Networks) ERS or VSP with stack for this because they had the right features at a reasonable price, but any switches that do stacking or can do multiple 10Gbps uplinks with routing should do. That was what I always found to be the most stable configuration. There also may be advantages to using a switch stack from the same manufacturer as the up stream router. That also opens up the possibility of doing 100Gbps to the intermediate switches and doing a more 
traditional primary backup configuration on the firewalls.

Thanks! We actually have a general contract to "prefer" components from one of the (expensive) manufacturers, which is good to get things more homogeneous at least.
Let's see how the discussion turns out :-).

Cheers and many thanks,
	Oliver

On Mon, May 10, 2021, 7:21 PM Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:

    Also answering inline.

    Am 11.05.21 um 00:55 schrieb Paul Robert Marino:
     > I'm adding replies to your replies inline below
     >
     > On Mon, May 10, 2021, 5:55 PM Oliver Freyermuth
     > <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
     >>
     >> Hey Paul,
     >>
     >> many thanks for the detailed reply!
     >> Some comments inline.
     >>
     >> Am 10.05.21 um 18:57 schrieb Paul Robert Marino:
     >>> hey Oliver,
     >>> I've done similar things over the years, a lot of fun lab experiments
     >>> and found it really it comes down to a couple of things.
     >>> I did some POC testing with contrackd and some experimental code
     >>> around trunking across multiple firewalls with a sprinkling of
     >>> virtualization.
     >>> There were a few scenarios I tried, some involving OpenVSwitch
     >>> (because I was experimenting with SPBM) and some not, with contrackd
     >>> similarly configured.
     >>> All the scenarios were interesting but they all had relatively rare on
     >>> slow (<1Gbs) network issues that grew exponentially in frequency on
     >>> higher speed networks (>1Gbs) with latency in contrackd syncing.
     >>
     >> Indeed, we'd strive for ~20 Gb/s in our case, so this experience surely is important to hear about.
     >>
     >>> What i found is the best scenario was to use Quagga for dynamic
     >>> routing to load balance the traffic between the firewall IP's,
     >>> keepalived to load handle IP failover, and contrackd (in a similar
     >>> configuration to the one you described) to keep the states in sync
     >>> there are a few pitfalls in going down this route caused by bad and or
     >>> outdated documentation for both Quagga and keepalived. I'm also going
     >>> to give you some recommendations about some hardware topology stuff
     >>> you may not think about initially.
     >>
     >> I'm still a bit unsure if we are on the same page, but that may just be caused by my limited knowledge of Quagga.
     >> To my understanding, Quagga uses e.g. OSPF and hence can, if the routes have the same path, load-balance.
     >>
     >> However, in our case, we'd want to go for active/active firewalls (which of course are also routers).
     >> But that means we have internal machines on one side, which use a single default gateway (per VLAN),
     >> then our active/active firewall, and then the outside world (actually a PtP connection to an upstream router).
     >>
     >> Can Quagga help me to actively use both firewalls in a load-balancing and redundant way?
     >> The idea here is that the upstream router has high bandwidth, so using more than one firewall allows to achieve better throughput,
     >> and with active/active we'd also strive for redundancy (i.e. reduced throughput if one firewall fails).
     >> To my understanding, OSPF / Quagga could do this if the firewalls are placed between routers also joining via OSPF.
     >> But is there also a way to have the clients directly talk to our firewalls, and the firewalls to a single upstream router (which we don't control)?
     >>
     >> A simple drawing may help:
     >>
     >>                ____  FW A ____
     >>               /               \
     >> Client(s) --                 --PtP-- upstream router
     >>               \____  FW B ____/
     >>
     >> This is why I thought about using xt_cluster and giving both FW A and FW B the very same IP (the default gateway of the clients)
     >> and the very same MAC at the same time, so the switch duplicates the packets, and then FW A accepts some packets and FW B the remaining ones
     >> via filtering with xt_cluster.
     >>
     >> Can Quagga do something in this picture, or simplify this picture?
     >> The upstream router also sends all incoming packets to a single IP in the PtP network, i.e. the firewall nodes need to show up as "one converged system"
     >> to both the clients on one side and the upstream router on the other side.
     >
     >
     > I understand what you are shooting for but it's dangerous at those
     > data rates and not achievable via stock existing software.
     > I did write some POC code years ago for a previous employer but
     > determined it was too dangerous to put into production without some
     > massive kernel changes such as using something like RDMA over
     > dedicated high speed interfaces or linking the systems over the PCI
     > express busses to sync the states instead of using contrackd.
     >
     > So load balancing is a better choice in this case, and many middle to
     > higher end managed switches that have routers built in can do OSPF.
     > I've seen many stackable switches that can do it. By the way Quagga
     > supports several other dynamic routing protocols not just just OSPF.

    Thanks, now I understand your answer much better — the classical case of intention getting lost between the lines.
    Indeed, this is important experience, many thanks for sharing it!

    I was already unsure if with such a solution I could really expect to achieve these data rates,
    so this warning is worth its weight in gold.
    I'll still play around with this setup in the lab, but testing at scale is also not easy
    (for us) in the lab, so again this warning is very useful so we won't take this into production.

    The problem which made me think about all this is that we don't have control of the upstream router.
    That made me hope for a solution which does not require changes on that end.
    But of course we can communicate with the operators
    and see if we can find a way to use dynamic routing on that end.

     > The safest and easiest option for you would be to use 100Gbs fibre
     > connection instead, possibly with direct attach cables if you want to
     > save on optics, and do primary secondary failover.

    Sadly, the infrastructure further upstream is not yet upgraded to support 100 Gb/s (and will not be in the near future),
    otherwise, this surely would have been the easier option.

     >>> I will start with Quagga because the bad documentation part is easy to cover.
     >>> in the Quagga documentation they recommend that you put a routable IP
     >>> on a loopback interface and attach Quagga the daemon for the dynamic
     >>> routing service of your choice to it, That works fine on BSD and old
     >>> versions of Linux from 20 years ago but any thing running a Linux
     >>> kernel version of 2.4 or higher will not allow it unless you change
     >>> setting in /etc/sysctrl.conf and the Quagga documentation tells you to
     >>> make those changes. DO NOT DO WHAT THEY SAY, its wrong and dangerous.
     >>> Instead create a "dummy" interface with a routable IP for this
     >>> purpose. a dummy interface is a special kind of interface meant for
     >>> exactly the scenario described and works well without compromising the
     >>> security of your firewall.
     >>
     >> Thanks for this helpful advice!
     >> Even though I am not sure yet Quagga will help me out in this picture,
     >> I am now already convinced we will have a situation in which Quagga will help us out.
     >> So this is noted down for future use :-).
     >>
     >>> Keepalived
     >>> the main error in keepalived's documentation is is most of the
     >>> documentation and howto's you will find about it on the web are based
     >>> on a 15 year old howto which had a fundamental mistake in how VRRP
     >>> works, and what the "state"  flag actually does because its not
     >>> explained well in the man file. "state" in a "vrrp_instance" should
     >>> always be set to "MASTER" on all nodes and the priority should be used
     >>> to determine which node should be the preferred master. the only time
     >>> you should ever set state to "BACKUP" is if you have a 3rd machine
     >>> that you never want to become the master which you are just using for
     >>> quorum and in that case its priority should also be set to "0"
     >>> (failed) . setting the state to "BACKUP" will seem to work fine until
     >>> you have a failover event when the interface will continually go ip
     >>> and done on the backup node. on the mac address issue keepalived will
     >>> apr ping the subnets its attached to so that's generally not an issue
     >>> but I would recommend using vmac's (virtual mac addresses) assuming
     >>> the kernel for your distro and your network cards support it because
     >>> that way it just looks to the switch like it changed a port due to
     >>> some physical topology change and switches usually handle that very
     >>> gracefully, but don't always handle the mac address change for IP
     >>> addresses as quickly.
     >>> I also recommend reading the RFC's on VRRP particularly the parts that
     >>> explain how the elections and priorities work, they are a quick and
     >>> easy read and will really give you a good idea of how to configure
     >>> keepalived properly to achieve the failover and recovery behavior you
     >>> want.
     >>
     >> See above on the virtual MACs — if the clients should use both firewalls at the same time,
     >> I think I'd need a single MAC for both, so the clients only see a single default gateway.
     >> In a more classic setup, we've used pcs (pacemaker and corosync) to successfully migrate virtual IPs and MAC addresses.
     >> It has worked quite reliable (using Kronosnet for communication).
     >> But we've also used Keepalived some years ago successfully :-).
     >>
     >>> On the hardware topology
     >>> I recommend using dedicated interfaces for contrackd, really you don't
     >>> need anything faster than 100Mbps even if the data interfaces are
     >>> 100Gbps but i usually use 1 Gbps interfaces for this. they can be on
     >>> their own dedicated switches or crossover interfaces. the main concern
     >>> here is securely handling a large number of tiny packets so having
     >>> dedicated network card buffers to handle microburst  is useful and if
     >>> you can avoid latency from a switch that's trying to be too smart for
     >>> its own good that's for the best.
     >>
     >> Indeed, we have 1 Gb/s crossover link, and use a 1 Gb/s connection through a switch in case this would ever fail for some reason —
     >> we use these links both for conntrackd and for Kronosnet communication by corosync.
     >>
     >>> For keepalived use dedicated VLAN's on each physical interface to
     >>> handle the heartbeats and group the VRRP interfaces. to insure the
     >>> failovers of the IP's on both sides are handled correctly.
     >>> If you only have 2 firewalls I recommend using a an additional device
     >>> on each side for quorum in a backup/failed mode as described above.
     >>> Assuming a 1 second or greater interval the device could be something
     >>> as simple as a Raspberry PI it really doesn't need to be anything
     >>> powerful because its just adding a heartbeat to the cluster, but for
     >>> sub second intervals you may need something more powerful because sub
     >>> second intervals can eat a surprising amount of CPU.
     >>
     >> We currently went without an external third party and let corosync/pacemaker use a STONITH device to explicitly kill the other node
     >> and establish a defined state if heartbeats get lost. We might think about a third machine at some point to get an actual quorum, indeed.
     >
     >
     > I get why you might think to use corosync/pacemaker for this if you
     > weren't familiar with keepalived and LVS in the kernel,  but it's
     > hammering a square peg in a round hole when you have a perfectly
     > shaped and sized peg available to you that's actually been around a
     > lot longer and works a lot more predictably, faster and more reliably
     > by leveraging parts of the kernels network stack designed specifically
     > for this use case. I've done explicit kills of the other device via
     > cross connected hardware watchdog devices via keepalived before and it
     > was easy.
     > By the way if you don't know what LVS is it's the kernels builtin
     > layer 3 network load balancer stack that was designed with these kind
     > of failover scenarios in mind keepalived is just a wrapper around LVS
     > that adds VRRP based heartbeating and hooks to allow you to call
     > external scripts for actions based on heart beat state change events
     > and additional watchdog scripts which can also trigger state changes.
     > To be clear i wouldn't use keepalived to handle process master slave
     > failovers i would use corosync and pacemaker, or in some cases
     > Clusterd for that because they are usually the right tool for the job,
     > but for firewall and or network load balancer failover i would always
     > use keepalived because its the right tool for that job.
     >

    Our main reasoning for corosync/pacemaker was that we've used it for the predecessor setup quite successfully for ~7 years,
    while we have only used keepalived in smaller configurations (but it also served us well).
    You raise many valid points, so even though pacemaker/corosync has not disappointed us (as of yet), we might indeed reconsider this decision.

    Cheers and thanks,
             Oliver

     >
     >>
     >> Cheers and thanks again,
     >>          Oliver
     >>
     >>>
     >>>
     >>> On Sun, May 9, 2021 at 3:16 PM Oliver Freyermuth
     >>> <freyermuth@xxxxxxxxxxxxxxxxxx <mailto:freyermuth@xxxxxxxxxxxxxxxxxx>> wrote:
     >>>>
     >>>> Dear netfilter experts,
     >>>>
     >>>> we are trying to setup an active/active firewall, making use of "xt_cluster".
     >>>> We can configure the switch to act like a hub, i.e. both machines can share the same MAC and IP and get the same packets without additional ARPtables tricks.
     >>>>
     >>>> So we set rules like:
     >>>>
     >>>>     iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
     >>>>     iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP
     >>>>
     >>>> Ideally, it we'd love to have the possibility to scale this to more than two nodes, but let's stay with two for now.
     >>>>
     >>>> Basic tests show that this works as expected, but the details get messy.
     >>>>
     >>>> 1. Certainly, conntrackd is needed to synchronize connection states.
     >>>>       But is it always "fast enough"?
     >>>>       xt_cluster seems to match by the src_ip of the original direction of the flow[0] (if I read the code correctly),
     >>>>       but what happens if the reply to an outgoing packet arrives at both firewalls before state is synchronized?
     >>>>       We are currently using conntrackd in FTFW mode with a direct link, set "DisableExternalCache", and additonally set "PollSecs 15" since without that it seems
     >>>>       only new and destroyed connections are synced, but lifetime updates for existing connections do not propagate without polling.
     >>>>       Maybe another way which e.g. may use XOR(src,dst) might work around tight synchronization requirements, or is it possible to always uses the "internal" source IP?
     >>>>       Is anybody doing that with a custom BPF?
     >>>>
     >>>> 2. How to do failover in such cases?
     >>>>       For failover we'd need to change these rules (if one node fails, the total-nodes will change).
     >>>>       As an alternative, I found [1] which states multiple rules can be used and enabled / disabled,
     >>>>       but does somebody know of a cleaner (and easier to read) way, also not costing extra performance?
     >>>>
     >>>> 3. We have several internal networks, which need to talk to each other (partially with firewall rules and NATting),
     >>>>       so we'd also need similar rules there, complicating things more. That's why a cleaner way would be very welcome :-).
     >>>>
     >>>> 4. Another point is how to actually perform the failover. Classical cluster suites (corosync + pacemaker)
     >>>>       are rather used to migrate services, but not to communicate node ids and number of total active nodes.
     >>>>       They can probably be tricked into doing that somehow, but they are not designed this way.
     >>>>       TIPC may be something to use here, but I found nothing "ready to use".
     >>>>
     >>>> You may also tell me there's a better way to do this than use xt_cluster (custom BPF?) — we've up to now only done "classic" active/passive setups,
     >>>> but maybe someone on this list has already done active/active without commercial hardware, and can share experience from this?
     >>>>
     >>>> Cheers and thanks in advance,
     >>>>           Oliver
     >>>>
     >>>> PS: Please keep me in CC, I'm not subscribed to the list. Thanks!
     >>>>
     >>>> [0] https://github.com/torvalds/linux/blob/10a3efd0fee5e881b1866cf45950808575cb0f24/net/netfilter/xt_cluster.c#L16-L19 <https://github.com/torvalds/linux/blob/10a3efd0fee5e881b1866cf45950808575cb0f24/net/netfilter/xt_cluster.c#L16-L19>
     >>>> [1] https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@xxxxxxxxxxxxx/ <https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@xxxxxxxxxxxxx/>
     >>>>
     >>>> --
     >>>> Oliver Freyermuth
     >>>> Universität Bonn
     >>>> Physikalisches Institut, Raum 1.047
     >>>> Nußallee 12
     >>>> 53115 Bonn
     >>>> --
     >>>> Tel.: +49 228 73 2367
     >>>> Fax:  +49 228 73 7869
     >>>> --
     >>>>
     >>
     >>
     >> --
     >> Oliver Freyermuth
     >> Universität Bonn
     >> Physikalisches Institut, Raum 1.047
     >> Nußallee 12
     >> 53115 Bonn
     >> --
     >> Tel.: +49 228 73 2367
     >> Fax:  +49 228 73 7869
     >> --
     >>

    -- 
    Oliver Freyermuth
    Universität Bonn
    Physikalisches Institut, Raum 1.047
    Nußallee 12
    53115 Bonn
    --
    Tel.: +49 228 73 2367
    Fax:  +49 228 73 7869
    --

--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature