Re: [PATCH] Bpf Helper Function BPF_FUNC_skb_change_dsfield

Martin KaFai Lau <kafai@xxxxxx> · Tue, 21 Dec 2021 13:52:27 -0800

On Tue, Dec 21, 2021 at 11:46:40AM -0800, Maciej Żenczykowski wrote:
> On Tue, Dec 21, 2021 at 11:16 AM Tyler Wear (QUIC)
> <quic_twear@xxxxxxxxxxx> wrote:
> > > On Mon, Dec 20, 2021 at 07:18:42PM -0800, Yonghong Song wrote:
> > > > On 12/20/21 12:40 PM, Tyler Wear wrote:
> > > > > New bpf helper function BPF_FUNC_skb_change_dsfield "int
> > > > > bpf_skb_change_dsfield(struct sk_buff *skb, u8 mask, u8 value)".
> > > > > BPF_PROG_TYPE_CGROUP_SKB typed bpf_prog which currently can be
> > > > > attached to the ingress and egress path. The helper is needed
> > > > > because this type of bpf_prog cannot modify the skb directly.
> > > > >
> > > > > Used by a bpf_prog to specify DS field values on egress or ingress.
> > > >
> > > > Maybe you can expand a little bit here for your use case?
> > > > I know DS field might help but a description of your actual use case
> > > > will make adding this helper more compelling.
> > > +1.  More details on the use case is needed.
> > > Also, having an individual helper for each particular header field is too specific.
> > >
> > > For egress, there is bpf_setsockopt() for IP_TOS and IPV6_TCLASS and it can be called in other cgroup hooks. e.g.
> > > BPF_PROG_TYPE_SOCK_OPS during tcp ESTABLISHED event.
> > > There is an example in tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c.
> > > Is it enough for egress?
> >
> > Using bpf_setsockopt() has 2 issues: 1) it changes the userspace visible state 2) won't work with udp sendmsg cmsg
> 
> Right, so to clarify since I've been working with Tyler on a project
> of which this patch is a small component.
> Note, I may be wrong here, I don't fully understand how all of this
> works... but:
> 
> ad 1) AFAIK if bpf calls bpf_setsockopt on the socket in question,
> then userspace's view of the socket settings via
> getsockopt(IP_TOS/IPV6_TCLASS) will also be affected - this may be
> undesirable (it's technically userspace visible change in behaviour
> and could, as unlikely as it is, lead to application misbehaviour).
> This can be worked around via also overriding getsockopt/setsockopt
> with bpf, but then you need to store the value to return to userspace
> somewhere... AFAICT it all ends up being pretty ugly and very complex.
CGROUP_(SET|GET)SOCKOPT is created for that.
The user's value can be stored in bpf_sk_storage.

> 
> I wouldn't be worried about needing to override each individual field,
> as the only other field that looks likely to be potentially beneficial
> to override would be the ipv6 flowlabel.
> 
> ad 2) I don't think the bpf_setsockopt(IP_TOS/IPV6_TCLASS) approach
> works for packets generated via udp sendmsg where cmsg is being used
> to set tos.
There is CGROUP_UDP[4|6]_SENDMSG.  Right now, it can only change the addr.
tos/tclass support could be added.

> 3) I also think the bpf_setsockopt(IP_TOS/IPV6_TCLASS) might be too
> late, since it would be in response to an already built packet, and
> would thus presumably only take effect on the next packet, and not for
> this one, no?
The bpf_setsockopt can be called in bind and connect.
Is it not early enough?

> Technically this could be done by attaching the programs to tc egress
> instead of the cgroup hook, but then it's per interface, which is
> potentially the wrong granularity...
Right, there is advantage to do it at higher layer,
and earlier also.

If the tos/tclass value can be changed early on, the correct
ip[6] header can be written at the beginning instead
of redoing it later and need to worry about the skb_clone_writable(),
rewriting it, do iph->check..etc.

> As for what is driving this?  Upcoming wifi standard to allow access
> points to inform client devices how to dscp mark individual flows.
Interesting.

How does the sending host get this dscp value from wifi
and then affect the dscp of a particular flow?  Is the dscp
going to be stored in a bpf map for the bpf prog to use?

Are you testing on TCP also?

> As for the patch itself, I wonder if the return value shouldn't be
> reversed, currently '1 if the DS field is set, 0 if it is not set.'
> But I think returning 0 on success and an error on failure is more in
> line with what other bpf helpers do?
> OTOH, it does match bpf_skb_ecn_set_ce() returning 0 on failure...
If adding a helper , how about making the bpf_skb_store_bytes()
available to BPF_PROG_TYPE_CGROUP_SKB?  Then it will be
more flexible to change other fields in the future in
the network and transport header.