On Thu, 7 May 2020 at 22:25, Jakub Kicinski <kuba@xxxxxxxxxx> wrote: > > On Thu, 7 May 2020 18:43:47 +0200 Daniel Borkmann wrote: > > > Thanks for the patch, it indeed fixes our problem! I spent some more time > > > trying to understand the checksum offload stuff, here is where I am: > > > > > > On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE, > > > everything works by default since the rest of the stack does checksumming in > > > software. > > > > > > On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum > > > will adjust for the data that is being removed from the skb. The rest of the > > > stack will use the correct value, all is well. > > > > > > However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY: > > > the API of skb_adjust_room doesn't tell us whether the user intends to > > > remove headers or data, and how that will influence csum_level. > > > From my POV, skb_adjust_room currently does the wrong thing. > > > I think we need to fix skb_adjust_room to do the right thing by default, > > > rather than extending the API. We spent a lot of time on tracking this down, > > > so hopefully we can spare others the pain. > > > > > > As Jakub alludes to, we don't know when and how often to call > > > __skb_decr_checksum_unnecessary so we should just > > > unconditionally downgrade a packet to CHECKSUM_NONE if we encounter > > > CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple > > > enough to land as a fix via the bpf tree (which is important for our > > > production kernel). As a follow up we could add the inverse of the flags you > > > propose via bpf-next. > > > > > > What do you think? > > > > My concern with unconditionally downgrading a packet to CHECKSUM_NONE would > > basically trash performance if we have to fallback to sw in fast-path, these > > helpers are also used in our LB case for DSR, for example. I agree that it > > sucks to expose these implementation details though. So eventually we'd end > > up with 3 csum flags: inc/dec/reset to none. bpf_skb_adjust_room() is already > > a complex to use helper with all its flags where you end up looking into the > > implementation detail to understand what it is really doing. I'm not sure if > > we make anything worse, but I do see your concern. :/ (We do have bpf_csum_update() > > helper as well. I wonder whether we should split such control into a different > > helper.) > > Probably stating the obvious but for decap of UDP tunnels which carry > locally terminated flows - we'd probably also want the upgrade from > UNNECESSARY to COMPLETE, like we do in the kernel > (skb_checksum_try_convert()). Tricky. I guess this is an argument in the direction that bpf_adjust_room is too low level an API? -- Lorenz Bauer | Systems Engineer 6th Floor, County Hall/The Riverside Building, SE1 7PB, UK www.cloudflare.com