Re: Running an active/active firewall/router (xt_cluster?)

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Tue, 11 May 2021 00:58:06 +0200

Hi,

many thanks for this elaborate reply!

Am 11.05.21 um 00:19 schrieb Pablo Neira Ayuso:
Hi,

On Sun, May 09, 2021 at 07:52:27PM +0200, Oliver Freyermuth wrote:
Dear netfilter experts,

we are trying to setup an active/active firewall, making use of
"xt_cluster".  We can configure the switch to act like a hub, i.e.
both machines can share the same MAC and IP and get the same packets
without additional ARPtables tricks.

So we set rules like:

  iptables -I PREROUTING -t mangle -i external_interface -m cluster --cluster-total-nodes 2 --cluster-local-node 1 --cluster-hash-seed 0xdeadbeef -j MARK --set-mark 0xffff
  iptables -A PREROUTING -t mangle -i external_interface -m mark ! --mark 0xffff -j DROP

I'm attaching an old script to set up active-active I remember to have
used time ago, I never found the time to upstream this.

this is really helpful indeed.
While we use Shorewall (which simplifies many things, but has no abstraction for xt_cluster as far as I am aware of),
it helps to see all rules written up together to translate them for Shorewall, and also the debugging rules are very helpful.

Ideally, it we'd love to have the possibility to scale this to more
than two nodes, but let's stay with two for now.

IIRC, up to two nodes should be easy with the existing codebase. To
support more than 2 nodes, conntrackd needs to be extended, but it
should be doable.

Basic tests show that this works as expected, but the details get messy.

1. Certainly, conntrackd is needed to synchronize connection states.
    But is it always "fast enough"?  xt_cluster seems to match by the
    src_ip of the original direction of the flow[0] (if I read the code
    correctly), but what happens if the reply to an outgoing packet
    arrives at both firewalls before state is synchronized?

You can avoid this by setting DisableExternalCache to off. Then, in
case one of your firewall node goes off, update the cluster rules and
inject the entries (via keepalived, or your HA daemon of choice).

Recommended configuration is DisableExternalCache off and properly
configure your HA daemon to assist conntrackd. Then, the conntrack
entries in the "external cache" of conntrackd are added to the kernel
when needed.

You caused a classic "facepalming" moment. Of course, that will solve (1)
completely. My initial thinking when disabling the external cache
was before I understood how xt_cluster works, and before I found that it uses the direction
of the flow, and then it just escaped my mind.
Thanks for clearing this up! :-)

    We are currently using conntrackd in FTFW mode with a direct
    link, set "DisableExternalCache", and additonally set "PollSecs
    15" since without that it seems only new and destroyed
    connections are synced, but lifetime updates for existing
    connections do not propagate without polling.

No need to set on PollSecs. Polling should be disabled. Did you enable
event filtering? You should synchronize receive update too. Could you
post your configuration file?

Sure, it's attached — I'm doing event filtering, but only by address and protocol,
not by flow state, so I thought it to be harmless in this regard.
For my test, I just sent a continuous ICMP through the node,
and the flow itself was synced fine, but then the lifetime was not updated on the partner node unless polling was active,
and finally the flow was removed on the partner machine (lifetime expired) while it was being kept alive by updates
on the primary node.

This was with "DisableExternalCache on", on a CentOS 8.2 node, i.e.:
  Kernel 4.18.0-193.19.1.el8_2.x86_64
  conntrackd v1.4.4

[...]
2. How to do failover in such cases?
    For failover we'd need to change these rules (if one node fails,
    the total-nodes will change).  As an alternative, I found [1]
    which states multiple rules can be used and enabled / disabled,
    but does somebody know of a cleaner (and easier to read) way,
    also not costing extra performance?

If you use iptables, you'll have to update the rules on failure as you
describe. What performance cost are you refering to?

This was based on your comment here:
 https://lore.kernel.org/netfilter-devel/499BEBBF.7080705@xxxxxxxxxxxxx/

But probably, this is indeed premature thinking on my end —
with two firewalls, having two rules after failover should have even less impact than what you measured there.
I still think something like the /proc interface you described there would be cleaner, but I also don't know of a failover daemon
which could make use of it.

3. We have several internal networks, which need to talk to each
    other (partially with firewall rules and NATting), so we'd also need
    similar rules there, complicating things more. That's why a cleaner
    way would be very welcome :-).

Cleaner way, it should be possible to simplify this setup with
nftables.

Since we currently use Shorewall as simplification layer (which eases many things by its abstraction,
but still uses iptables behind the scenes), it's probably best for sanity not to mix here.
So the less "clean" way is likely the easier one for now.

4. Another point is how to actually perform the failover. Classical
    cluster suites (corosync + pacemaker) are rather used to migrate
    services, but not to communicate node ids and number of total active
    nodes.  They can probably be tricked into doing that somehow, but
    they are not designed this way.  TIPC may be something to use here,
    but I found nothing "ready to use".

I have used keepalived in the past with very simple configuration
files, and use their shell script API to interact with conntrackd.
I did not spend much time on corosync/pacemaker so far.

I was mostly thinking about the cluster rules —
I'd love to have a daemon which could adjust cluster-total-nodes and cluster-local-nodes,
instead of having two rules on one firewall when the other fails.

I think I can make the latter work with pacemaker/corosync, and also have it support conntrackd, though,
it might be fiddly, but should be doable.

Many thanks for the elaborate answer,
	Oliver

--
Oliver Freyermuth
Universität Bonn
Physikalisches Institut, Raum 1.047
Nußallee 12
53115 Bonn
--
Tel.: +49 228 73 2367
Fax:  +49 228 73 7869
--
# See also: http://conntrack-tools.netfilter.org/support.html
# 
# There are 3 different modes of running conntrackd: "alarm", "notrack" and "ftfw"
#
# The default package ships with a FTFW configuration, see /usr/share/doc/conntrackd*
# for example configurations for other modes.

#
# Synchronizer settings
#
Sync {
	Mode FTFW {
		#
		# Size of the resend queue (in objects). This is the maximum
		# number of objects that can be stored waiting to be confirmed
		# via acknoledgment. If you keep this value low, the daemon
		# will have less chances to recover state-changes under message
		# omission. On the other hand, if you keep this value high,
		# the daemon will consume more memory to store dead objects.
		# Default is 131072 objects.
		#
		# ResendQueueSize 131072

		#
		# This parameter allows you to set an initial fixed timeout
		# for the committed entries when this node goes from backup
		# to primary. This mechanism provides a way to purge entries
		# that were not recovered appropriately after the specified
		# fixed timeout. If you set a low value, TCP entries in
		# Established states with no traffic may hang. For example,
		# an SSH connection without KeepAlive enabled. If not set,
		# the daemon uses an approximate timeout value calculation
		# mechanism. By default, this option is not set.
		#
		# CommitTimeout 180

		#
		# If the firewall replica goes from primary to backup,
		# the conntrackd -t command is invoked in the script. 
		# This command schedules a flush of the table in N seconds.
		# This is useful to purge the connection tracking table of
		# zombie entries and avoid clashes with old entries if you
		# trigger several consecutive hand-overs. Default is 60 seconds.
		#
		# PurgeTimeout 60

		# Set the acknowledgement window size. If you decrease this
		# value, the number of acknowlegdments increases. More
		# acknowledgments means more overhead as conntrackd has to
		# handle more control messages. On the other hand, if you
		# increase this value, the resend queue gets more populated.
		# This results in more overhead in the queue releasing.
		# The following value is based on some practical experiments
		# measuring the cycles spent by the acknowledgment handling
		# with oprofile. If not set, default window size is 300.
		#
		# ACKWindowSize 300

		#
		# This clause allows you to disable the external cache. Thus,
		# the state entries are directly injected into the kernel
		# conntrack table. As a result, you save memory in user-space
		# but you consume slots in the kernel conntrack table for
		# backup state entries. Moreover, disabling the external cache
		# means more CPU consumption. You need a Linux kernel
		# >= 2.6.29 to use this feature. By default, this clause is
		# set off. If you are installing conntrackd for first time,
		# please read the user manual and I encourage you to consider
		# using the fail-over scripts instead of enabling this option!
		#
		DisableExternalCache On
	}

	#
	# Multicast IP and interface where messages are
	# broadcasted (dedicated link). IMPORTANT: Make sure
	# that iptables accepts traffic for destination
	# 225.0.0.50, eg:
	#
	#	iptables -I INPUT -d 225.0.0.50 -j ACCEPT
	#	iptables -I OUTPUT -d 225.0.0.50 -j ACCEPT
	#
	#Multicast {
		# 
		# Multicast address: The address that you use as destination
		# in the synchronization messages. You do not have to add
		# this IP to any of your existing interfaces. If any doubt,
		# do not modify this value.
		#
	#	IPv4_address 225.0.0.50

		#
		# The multicast group that identifies the cluster. If any
		# doubt, do not modify this value.
		#
	#	Group 3780

		#
		# IP address of the interface that you are going to use to
		# send the synchronization messages. Remember that you must
		# use a dedicated link for the synchronization messages.
		#
	#	IPv4_interface 192.168.100.100

		#
		# The name of the interface that you are going to use to
		# send the synchronization messages.
		#
	#	Interface eth2

		# The multicast sender uses a buffer to enqueue the packets
		# that are going to be transmitted. The default size of this
		# socket buffer is available at /proc/sys/net/core/wmem_default.
		# This value determines the chances to have an overrun in the
		# sender queue. The overrun results packet loss, thus, losing
		# state information that would have to be retransmitted. If you
		# notice some packet loss, you may want to increase the size
		# of the sender buffer. The default size is usually around
		# ~100 KBytes which is fairly small for busy firewalls.
		#
	#	SndSocketBuffer 1249280

		# The multicast receiver uses a buffer to enqueue the packets
		# that the socket is pending to handle. The default size of this
		# socket buffer is available at /proc/sys/net/core/rmem_default.
		# This value determines the chances to have an overrun in the
		# receiver queue. The overrun results packet loss, thus, losing
		# state information that would have to be retransmitted. If you
		# notice some packet loss, you may want to increase the size of
		# the receiver buffer. The default size is usually around
		# ~100 KBytes which is fairly small for busy firewalls.
		#
	#	RcvSocketBuffer 1249280

		# 
		# Enable/Disable message checksumming. This is a good
		# property to achieve fault-tolerance. In case of doubt, do
		# not modify this value.
		#
	#	Checksum on
	#}
	#
	# You can specify more than one dedicated link. Thus, if one dedicated
	# link fails, conntrackd can fail-over to another. Note that adding
	# more than one dedicated link does not mean that state-updates will
	# be sent to all of them. There is only one active dedicated link at
	# a given moment. The `Default' keyword indicates that this interface
	# will be selected as the initial dedicated link. You can have 
	# up to 4 redundant dedicated links. Note: Use different multicast 
	# groups for every redundant link.
	#
	# Multicast Default {
	#	IPv4_address 225.0.0.51
	#	Group 3781
	#	IPv4_interface 192.168.100.101
	#	Interface eth3
	#	# SndSocketBuffer 1249280
	#	# RcvSocketBuffer 1249280
	#	Checksum on
	# }

	#
	# You can use Unicast UDP instead of Multicast to propagate events.
	# Note that you cannot use unicast UDP and Multicast at the same
	# time, you can only select one.
	# 
	#UDP {
		# 
		# UDP address that this firewall uses to listen to events.
		#
		# IPv4_address 192.168.2.100
		#
		# or you may want to use an IPv6 address:
		#
		# IPv6_address fe80::215:58ff:fe28:5a27

		#
		# Destination UDP address that receives events, ie. the other
		# firewall's dedicated link address.
		#
		# IPv4_Destination_Address 192.168.2.101
		#
		# or you may want to use an IPv6 address:
		#
		# IPv6_Destination_Address fe80::2d0:59ff:fe2a:775c

		#
		# UDP port used
		#
		# Port 3780

		#
		# The name of the interface that you are going to use to
		# send the synchronization messages.
		#
		# Interface eth2

		# 
		# The sender socket buffer size
		#
		# SndSocketBuffer 1249280

		#
		# The receiver socket buffer size
		#
		# RcvSocketBuffer 1249280

		# 
		# Enable/Disable message checksumming. 
		#
		# Checksum on
	# }

	# main connection via crossover cable
	UDP Default {
                IPv4_address 192.168.1.1
                IPv4_Destination_Address 192.168.1.2
                Port 3780
                Interface eno1
                SndSocketBuffer 24985600
                RcvSocketBuffer 24985600
                Checksum on
        }
	# backup via virt network
	UDP {
               IPv4_address 10.160.5.204
               IPv4_Destination_Address 10.160.5.205
               Port 3780
               Interface eno2
               SndSocketBuffer 24985600
               RcvSocketBuffer 24985600
               Checksum on
        }

	# 
	# Other unsorted options that are related to the synchronization.
	# 
	Options {
		#
		# TCP state-entries have window tracking disabled by default,
		# you can enable it with this option. As said, default is off.
		# This feature requires a Linux kernel >= 2.6.36.
		#
		# TCPWindowTracking Off
		TCPWindowTracking On

		#ExpectationSync on
		#ExpectationSync {
		#	h.323
		#}
	}
}

#
# General settings
#
General {
	#
	# Set the nice value of the daemon, this value goes from -20
	# (most favorable scheduling) to 19 (least favorable). Using a
	# very low value reduces the chances to lose state-change events.
	# Default is 0 but this example file sets it to most favourable
	# scheduling as this is generally a good idea. See man nice(1) for
	# more information.
	#
	Nice -20

	#
	# Select a different scheduler for the daemon, you can select between
	# RR and FIFO and the process priority (minimum is 0, maximum is 99).
	# See man sched_setscheduler(2) for more information. Using a RT
	# scheduler reduces the chances to overrun the Netlink buffer.
	#
	# Scheduler {
	#	Type FIFO
	#	Priority 99
	# }

	#
	# Number of buckets in the cache hashtable. The bigger it is,
	# the closer it gets to O(1) at the cost of consuming more memory.
	# Read some documents about tuning hashtables for further reference.
	#
	HashSize 32768

	#
	# Maximum number of conntracks, it should be double of: 
	# $ cat /proc/sys/net/netfilter/nf_conntrack_max
	# since the daemon may keep some dead entries cached for possible
	# retransmission during state synchronization.
	#
	HashLimit 131072

	#
	# Logfile: on (/var/log/conntrackd.log), off, or a filename
	# Default: off
	#
	LogFile on

	#
	# Syslog: on, off or a facility name (daemon (default) or local0..7)
	# Default: off
	#
	#Syslog on

	#
	# Lockfile
	# 
	LockFile /var/lock/conntrack.lock

	#
	# Unix socket configuration
	#
	UNIX {
		Path /var/run/conntrackd.ctl
		Backlog 20
	}

	#
	# Netlink event socket buffer size. If you do not specify this clause,
	# the default buffer size value in /proc/net/core/rmem_default is
	# used. This default value is usually around 100 Kbytes which is
	# fairly small for busy firewalls. This leads to event message dropping
	# and high CPU consumption. This example configuration file sets the
	# size to 2 MBytes to avoid this sort of problems.
	#
	NetlinkBufferSize 2097152

	#
	# The daemon doubles the size of the netlink event socket buffer size
	# if it detects netlink event message dropping. This clause sets the
	# maximum buffer size growth that can be reached. This example file
	# sets the size to 8 MBytes.
	#
	NetlinkBufferSizeMaxGrowth 8388608

	#
	# If the daemon detects that Netlink is dropping state-change events,
	# it automatically schedules a resynchronization against the Kernel
	# after 30 seconds (default value). Resynchronizations are expensive
	# in terms of CPU consumption since the daemon has to get the full
	# kernel state-table and purge state-entries that do not exist anymore.
	# Be careful of setting a very small value here. You have the following
	# choices: On (enabled, use default 30 seconds value), Off (disabled)
	# or Value (in seconds, to set a specific amount of time). If not
	# specified, the daemon assumes that this option is enabled.
	#
	# NetlinkOverrunResync On

	#
	# If you want reliable event reporting over Netlink, set on this
	# option. If you set on this clause, it is a good idea to set off
	# NetlinkOverrunResync. This option is off by default and you need
	# a Linux kernel >= 2.6.31.
	#
	# NetlinkEventsReliable Off
	NetlinkEventsReliable On

	# 
	# By default, the daemon receives state updates following an
	# event-driven model. You can modify this behaviour by switching to
	# polling mode with the PollSecs clause. This clause tells conntrackd
	# to dump the states in the kernel every N seconds. With regards to
	# synchronization mode, the polling mode can only guarantee that
	# long-lifetime states are recovered. The main advantage of this method
	# is the reduction in the state replication at the cost of reducing the
	# chances of recovering connections.
	#
	PollSecs 15

	#
	# The daemon prioritizes the handling of state-change events coming
	# from the core. With this clause, you can set the maximum number of
	# state-change events (those coming from kernel-space) that the daemon
	# will handle after which it will handle other events coming from the
	# network or userspace. A low value improves interactivity (in terms of
	# real-time behaviour) at the cost of extra CPU consumption.
	# Default (if not set) is 100.
	#
	# EventIterationLimit 100

	#
	# Event filtering: This clause allows you to filter certain traffic,
	# There are currently three filter-sets: Protocol, Address and
	# State. The filter is attached to an action that can be: Accept or
	# Ignore. Thus, you can define the event filtering policy of the
	# filter-sets in positive or negative logic depending on your needs.
	# You can select if conntrackd filters the event messages from 
	# user-space or kernel-space. The kernel-space event filtering
	# saves some CPU cycles by avoiding the copy of the event message
	# from kernel-space to user-space. The kernel-space event filtering
	# is prefered, however, you require a Linux kernel >= 2.6.29 to
	# filter from kernel-space. If you want to select kernel-space 
	# event filtering, use the keyword 'Kernelspace' instead of 
	# 'Userspace'.
	#
	Filter From Kernelspace {
		#
		# Accept only certain protocols: You may want to replicate
		# the state of flows depending on their layer 4 protocol.
		#
		Protocol Accept {
			TCP
			SCTP
			DCCP
                        UDP
                        ICMP
                        IPv6-ICMP
			# UDP
			# ICMP # This requires a Linux kernel >= 2.6.31
			# IPv6-ICMP # This requires a Linux kernel >= 2.6.31
		}

		#
		# Ignore traffic for a certain set of IP's: Usually all the
		# IP assigned to the firewall since local traffic must be
		# ignored, only forwarded connections are worth to replicate.
		# Note that these values depends on the local IPs that are
		# assigned to the firewall.
		#
		Address Ignore {
			IPv4_address 127.0.0.1 # loopback
                        IPv4_address 10.160.5.203 # VIP
			IPv4_address 10.160.5.204 # IP FW 1
			IPv4_address 10.160.5.205 # IP FW 2
			IPv4_address 192.168.1.0/24 # Crossover IPs
                        IPv6_address ::1 # loopback
			#IPv4_address 192.168.100.100 # dedicated link ip
			#
			# You can also specify networks in format IP/cidr.
			# IPv4_address 192.168.0.0/24
			#
			# You can also specify an IPv6 address
			# IPv6_address ::1
		}

		#
		# Uncomment this line below if you want to filter by flow state.
		# This option introduces a trade-off in the replication: it
		# reduces CPU consumption at the cost of having lazy backup 
		# firewall replicas. The existing TCP states are: SYN_SENT,
		# SYN_RECV, ESTABLISHED, FIN_WAIT, CLOSE_WAIT, LAST_ACK,
		# TIME_WAIT, CLOSED, LISTEN.
		#
		# State Accept {
		#	ESTABLISHED CLOSED TIME_WAIT CLOSE_WAIT for TCP
		# }
	}
}
Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature