On Tue, Dec 21, 2021 at 02:13:04PM -0800, Maciej Żenczykowski wrote: > > > ad 1) AFAIK if bpf calls bpf_setsockopt on the socket in question, > > > then userspace's view of the socket settings via > > > getsockopt(IP_TOS/IPV6_TCLASS) will also be affected - this may be > > > undesirable (it's technically userspace visible change in behaviour > > > and could, as unlikely as it is, lead to application misbehaviour). > > > This can be worked around via also overriding getsockopt/setsockopt > > > with bpf, but then you need to store the value to return to userspace > > > somewhere... AFAICT it all ends up being pretty ugly and very complex. > > CGROUP_(SET|GET)SOCKOPT is created for that. > > The user's value can be stored in bpf_sk_storage. > > Yes, it can be done, it's very complex to do so. > > The policy can change during run time (indeed that's probably a > relatively likely situation, > network gear notices a new high bandwidth connection and provides out > of band feedback > that it should be using a different dscp code point - we probably > don't want the full policy to > be present in the device because it might be a huge number of entries, > with wildcards). ic. got it. The value is very dynamic, so changing tos/tclass of a sock in a more static hook like bind won't work well. Details like this should have been in the commit message. > > > I wouldn't be worried about needing to override each individual field, > > > as the only other field that looks likely to be potentially beneficial > > > to override would be the ipv6 flowlabel. > > > > > > ad 2) I don't think the bpf_setsockopt(IP_TOS/IPV6_TCLASS) approach > > > works for packets generated via udp sendmsg where cmsg is being used > > > to set tos. > > There is CGROUP_UDP[4|6]_SENDMSG. Right now, it can only change the addr. > > tos/tclass support could be added. > > It could, that doesn't seem easier to do than this approach though. > [ ... ] > > > Technically this could be done by attaching the programs to tc egress > > > instead of the cgroup hook, but then it's per interface, which is > > > potentially the wrong granularity... > > > Right, there is advantage to do it at higher layer, > > and earlier also. > > > > If the tos/tclass value can be changed early on, the correct > > ip[6] header can be written at the beginning instead > > of redoing it later and need to worry about the skb_clone_writable(), > > rewriting it, do iph->check..etc. > > I would indeed like it if we could decouple what userspace wants, > from what the kernel/network actually uses. There would need to be > some sort of bpf hook, > that takes a socket/flow and returns the tos/dscp to actually use > (based on 5-tuple and other information). > > But again, this would be *much* more complex. In terms of the bpf prog doing the dscp-logic, it should be pretty much the same. Getting the 5-tuple, lookup from a map and return the dscp value. > > > As for what is driving this? Upcoming wifi standard to allow access > > > points to inform client devices how to dscp mark individual flows. > > Interesting. > > > > How does the sending host get this dscp value from wifi > > and then affect the dscp of a particular flow? Is the dscp > > going to be stored in a bpf map for the bpf prog to use? > > It gets it out of band via some wifi signaling mechanism. > Tyler probably knows the details. > > Storing flow match information to dscp mapping in a bpf map is indeed the plan. > > > Are you testing on TCP also? > > > > > As for the patch itself, I wonder if the return value shouldn't be > > > reversed, currently '1 if the DS field is set, 0 if it is not set.' > > > But I think returning 0 on success and an error on failure is more in > > > line with what other bpf helpers do? > > > OTOH, it does match bpf_skb_ecn_set_ce() returning 0 on failure... > > > If adding a helper , how about making the bpf_skb_store_bytes() > > available to BPF_PROG_TYPE_CGROUP_SKB? Then it will be > > more flexible to change other fields in the future in > > the network and transport header. > > I assume there's some reason why it's not available? Not that I am aware of. There was just no use case for it. The static case is handled in socket bind/creation time. The more dynamic case is handled in the udp_sendmsg which so far the use case is changing the address only. I don't mind adding bpf_skb_store_bytes() which could be useful for other header's fields.