Re: [PATCH v2] network: add rule to nftables backend that zeroes checksum of DHCP responses

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Wed, 30 Oct 2024 08:43:58 +0000

On Tue, Oct 29, 2024 at 11:21:36PM -0400, Laine Stump wrote:
> On 10/29/24 3:41 PM, Phil Sutter wrote:
> > On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote:
> > > On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote:
> > > > On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote:
> > > > > On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
> > > > > > > So when the extra rules are removed, then those same guests begin
> > > > > > > working? (You can easily remove the checksum rules with:
> > > > > > > 
> > > > > > >     nft delete chain ip libvirt_network postroute_mangle
> > > > > > > 
> > > > > > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to
> > > > > > > work with the 0-checksum rules removed. In this case tcpdump on virbr0
> > > > > > > shows "bad cksum", but when I look at tcpdump on the guest, it shows
> > > > > > > "udp cksum ok" though, so something else somewhere is setting the
> > > > > > > checksum to the correct value.
> > > > > > 
> > > > > > FWIW, I just tested an alternative workaround using tc. This works for
> > > > > > me with a FreeBSD guest and NIC switched to either e1000 or virtio:
> > > > > > 
> > > > > > # tc qd add dev vnetbr0 root handle 1: htb
> > > > > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
> > > > > > 	u32 match ip sport 67 ffff match ip dport 68 ffff \
> > > > > > 	action csum ip and udp
> > > > > 
> > > > > This feels like it is functionally closest to what we've had historically,
> > > > > even though it is annoying to have to deal with 'tc' tool, in addition
> > > > > to 'nft'. So I'm thinking this is probably the way we'll have to go.
> > > > 
> > > > Another ugly detail (inherent to 'tc') is that you have to attach a
> > > > classful qdisc to the interface since otherwise you can't add a filter
> > > > with attached action. While this may not be a problem in practice, there
> > > > is this side-effect of setting up a HTB on the bridge which by default
> > > > runs a "noqueue" qdisc.
> > > 
> > > I'm not that familiar with 'tc'.
> > > 
> > > Can you explain the functional effect of those 'qdisc' settings on
> > > virbr0, as if I know nothing :-)
> > 
> > 'tc' controls QoS in Linux. 'qdisc's define how congestion should be
> > handled (basically: queue or drop, prioritization, etc). The default
> > qdisc for virtual devices like bridge or veth is "noqueue" - it sets the
> > device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it
> > accordingly (calls dev_hard_start_xmit() after a few checks to make sure
> > the device is working).
> > 
> > HTB is a container of classes (for packet classification) which
> > themselves hold qdiscs. On my system at least, it doesn't come with
> > default classes and thus should not do much by itself (apart from
> > running the filter for us which we want. Anything else is overhead.
> > 
> > I'm not sure how much detail you need - "as if I know nothing" is a bit
> > like naively typing 'find /' and wondering when it will end. ;)
> > Please shoot if you need more details. For the time being, let me point
> > at some howto[2] I wrote long ago.
> > 
> > > > > > Another alternative might be to add the nftables rule for virtio-based
> > > > > > guests only.
> > > > > 
> > > > > The firewall rules are in a chain that's applied to all guests,
> > > > > so we have no where to add a per-guest rule.
> > > > 
> > > > With nftables, you may create a chain in netdev family which binds to
> > > > the specific device(s) needing the hack. It needs maintenance after
> > > > guest startup and shutdown, though.
> > > > 
> > > > BTW: libvirt supports configurations which don't involve a 'vnetbr0'
> > > > bridge. In this case, you will have to setup tc on the actual tap
> > > > device, right?
> > > 
> > > In those cases, we haven't historically set firewall rules, so
> > > users were on their own, so in that sense, it isn't a regression
> > > we need to solve. Also in those cases, the DHCP daemon would be
> > > off-host, and so packets we're getting back from it ought to
> > > have a checksum present, as they've been over a  physical link.
> > 
> > OK. From my perspective, attaching the tc qdisc/filter/action to
> > individual guest devices would still be the cleaner solution. If there's
> > no mechanism to attach this to, it might be easier to just stick
> > everything to the bridge, of course.
> 
> 
> We do already use tc for bandwidth control, and when a <bandwidth> element
> is present in the interface (or the network) we run some tc commands as a
> port is added to a network to (I *think*) reserve a portion of the bridge's
> bandwidth for the new interface controls, and when a port is deleted we
> again run some tc commands to remove it. (mprivozn added all of this and so
> therefore knows the most about it)
> 
> However, the tc commands on the network side (during the CreatePort API) I
> believe are done with only the network's bridge + the MAC address of the
> guest's NIC (and a "class_id" is created and sent back to QEMU and is there
> I guess used for some *other* tc commands to setup bandwidth upper limits
> for the tap after it's created.)
> 
> More significantly, the tap device hasn't even been created yet at the time
> QEMU allocates the port from the network driver, so we don't even have a
> "name of future tap device" that we could send in the NetworkPortCreate API
> XML.
> 
> So, I guess what all that means is that having the network driver run a
> tap-device-specific tc command won't be that simple. (Maybe we could add
> "virNetworkPortStart|Stop" APIs or something)

I would expect 'tc' rules to be set against virbr0, not the individual
NICs.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|