On Tue, Oct 29, 2024 at 11:21:36PM -0400, Laine Stump wrote: > On 10/29/24 3:41 PM, Phil Sutter wrote: > > On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote: > > > On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote: > > > > On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote: > > > > > On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote: > > > > > > Hi, > > > > > > > > > > > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote: > > > > > > > So when the extra rules are removed, then those same guests begin > > > > > > > working? (You can easily remove the checksum rules with: > > > > > > > > > > > > > > nft delete chain ip libvirt_network postroute_mangle > > > > > > > > > > > > > > BTW, I just now tried an e1000e NIC on Fedora guest and it continues to > > > > > > > work with the 0-checksum rules removed. In this case tcpdump on virbr0 > > > > > > > shows "bad cksum", but when I look at tcpdump on the guest, it shows > > > > > > > "udp cksum ok" though, so something else somewhere is setting the > > > > > > > checksum to the correct value. > > > > > > > > > > > > FWIW, I just tested an alternative workaround using tc. This works for > > > > > > me with a FreeBSD guest and NIC switched to either e1000 or virtio: > > > > > > > > > > > > # tc qd add dev vnetbr0 root handle 1: htb > > > > > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \ > > > > > > u32 match ip sport 67 ffff match ip dport 68 ffff \ > > > > > > action csum ip and udp > > > > > > > > > > This feels like it is functionally closest to what we've had historically, > > > > > even though it is annoying to have to deal with 'tc' tool, in addition > > > > > to 'nft'. So I'm thinking this is probably the way we'll have to go. > > > > > > > > Another ugly detail (inherent to 'tc') is that you have to attach a > > > > classful qdisc to the interface since otherwise you can't add a filter > > > > with attached action. While this may not be a problem in practice, there > > > > is this side-effect of setting up a HTB on the bridge which by default > > > > runs a "noqueue" qdisc. > > > > > > I'm not that familiar with 'tc'. > > > > > > Can you explain the functional effect of those 'qdisc' settings on > > > virbr0, as if I know nothing :-) > > > > 'tc' controls QoS in Linux. 'qdisc's define how congestion should be > > handled (basically: queue or drop, prioritization, etc). The default > > qdisc for virtual devices like bridge or veth is "noqueue" - it sets the > > device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it > > accordingly (calls dev_hard_start_xmit() after a few checks to make sure > > the device is working). > > > > HTB is a container of classes (for packet classification) which > > themselves hold qdiscs. On my system at least, it doesn't come with > > default classes and thus should not do much by itself (apart from > > running the filter for us which we want. Anything else is overhead. > > > > I'm not sure how much detail you need - "as if I know nothing" is a bit > > like naively typing 'find /' and wondering when it will end. ;) > > Please shoot if you need more details. For the time being, let me point > > at some howto[2] I wrote long ago. > > > > > > > > Another alternative might be to add the nftables rule for virtio-based > > > > > > guests only. > > > > > > > > > > The firewall rules are in a chain that's applied to all guests, > > > > > so we have no where to add a per-guest rule. > > > > > > > > With nftables, you may create a chain in netdev family which binds to > > > > the specific device(s) needing the hack. It needs maintenance after > > > > guest startup and shutdown, though. > > > > > > > > BTW: libvirt supports configurations which don't involve a 'vnetbr0' > > > > bridge. In this case, you will have to setup tc on the actual tap > > > > device, right? > > > > > > In those cases, we haven't historically set firewall rules, so > > > users were on their own, so in that sense, it isn't a regression > > > we need to solve. Also in those cases, the DHCP daemon would be > > > off-host, and so packets we're getting back from it ought to > > > have a checksum present, as they've been over a physical link. > > > > OK. From my perspective, attaching the tc qdisc/filter/action to > > individual guest devices would still be the cleaner solution. If there's > > no mechanism to attach this to, it might be easier to just stick > > everything to the bridge, of course. > > > We do already use tc for bandwidth control, and when a <bandwidth> element > is present in the interface (or the network) we run some tc commands as a > port is added to a network to (I *think*) reserve a portion of the bridge's > bandwidth for the new interface controls, and when a port is deleted we > again run some tc commands to remove it. (mprivozn added all of this and so > therefore knows the most about it) > > However, the tc commands on the network side (during the CreatePort API) I > believe are done with only the network's bridge + the MAC address of the > guest's NIC (and a "class_id" is created and sent back to QEMU and is there > I guess used for some *other* tc commands to setup bandwidth upper limits > for the tap after it's created.) > > More significantly, the tap device hasn't even been created yet at the time > QEMU allocates the port from the network driver, so we don't even have a > "name of future tap device" that we could send in the NetworkPortCreate API > XML. > > So, I guess what all that means is that having the network driver run a > tap-device-specific tc command won't be that simple. (Maybe we could add > "virNetworkPortStart|Stop" APIs or something) I would expect 'tc' rules to be set against virbr0, not the individual NICs. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|