Re: Checksum behaviour of bpf_redirected packets

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/7/20 5:54 PM, Lorenz Bauer wrote:
On Wed, 6 May 2020 at 22:55, Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
On 5/6/20 6:24 PM, Lorenz Bauer wrote:
On Wed, 6 May 2020 at 02:28, Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
On Mon, May 4, 2020 at 9:12 AM Lorenz Bauer <lmb@xxxxxxxxxxxxxx> wrote:

In our TC classifier cls_redirect [1], we use the following sequence
of helper calls to
decapsulate a GUE (basically IP + UDP + custom header) encapsulated packet:

    skb_adjust_room(skb, -encap_len,
BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
    bpf_redirect(skb->ifindex, BPF_F_INGRESS)

It seems like some checksums of the inner headers are not validated in
this case.
For example, a TCP SYN packet with invalid TCP checksum is still accepted by the
network stack and elicits a SYN ACK.

Is this known but undocumented behaviour or a bug? In either case, is
there a work
around I'm not aware of?

I thought inner and outer csums are covered by different flags and driver
suppose to set the right one depending on level of in-hw checking it did.

I've figured out what the problem is. We receive the following packet from
the driver:

      | ETH | IP | UDP | GUE | IP | TCP |
      skb->ip_summed == CHECKSUM_UNNECESSARY

ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx
checksum offloading. On this packet we run skb_adjust_room_mac(-encap),
and get the following:

      | ETH | IP | TCP |
      skb->ip_summed == CHECKSUM_UNNECESSARY

Note that ip_summed is still CHECKSUM_UNNECESSARY. After
bpf_redirect()ing into the ingress, we end up in tcp_v4_rcv. There
skb_checksum_init is turned into a no-op due to
CHECKSUM_UNNECESSARY.

I think this boils down to bpf_skb_generic_pop not adjusting ip_summed
accordingly. Unfortunately I don't understand how checksums work
sufficiently. Daniel, it seems like you wrote the helper, could you
take a look?

Right, so in the skb_adjust_room() case we're not aware of protocol
specifics. We do handle the csum complete case via skb_postpull_rcsum(),
but not CHECKSUM_UNNECESSARY at the moment. I presume in your case the
skb->csum_level of the original skb prior to skb_adjust_room() call
might have been 0 (that is, covering UDP)? So if we'd add the possibility
to __skb_decr_checksum_unnecessary() via flag, then it would become
skb->ip_summed = CHECKSUM_NONE? And to be generic, we'd need to do the
same for the reverse case. Below is a quick hack (compile tested-only);
would this resolve your case ...

Thanks for the patch, it indeed fixes our problem! I spent some more time
trying to understand the checksum offload stuff, here is where I am:

On NICs that don't support hardware offload ip_summed is CHECKSUM_NONE,
everything works by default since the rest of the stack does checksumming in
software.

On NICs that support CHECKSUM_COMPLETE, skb_postpull_rcsum
will adjust for the data that is being removed from the skb. The rest of the
stack will use the correct value, all is well.

However, we're out of luck on NICs that do CHECKSUM_UNNECESSARY:
the API of skb_adjust_room doesn't tell us whether the user intends to
remove headers or data, and how that will influence csum_level.
 From my POV, skb_adjust_room currently does the wrong thing.
I think we need to fix skb_adjust_room to do the right thing by default,
rather than extending the API. We spent a lot of time on tracking this down,
so hopefully we can spare others the pain.

As Jakub alludes to, we don't know when and how often to call
__skb_decr_checksum_unnecessary so we should just
unconditionally downgrade a packet to CHECKSUM_NONE if we encounter
CHECKSUM_UNNECESSARY in bpf_skb_generic_pop. It sounds simple
enough to land as a fix via the bpf tree (which is important for our
production kernel). As a follow up we could add the inverse of the flags you
propose via bpf-next.

What do you think?

My concern with unconditionally downgrading a packet to CHECKSUM_NONE would
basically trash performance if we have to fallback to sw in fast-path, these
helpers are also used in our LB case for DSR, for example. I agree that it
sucks to expose these implementation details though. So eventually we'd end
up with 3 csum flags: inc/dec/reset to none. bpf_skb_adjust_room() is already
a complex to use helper with all its flags where you end up looking into the
implementation detail to understand what it is really doing. I'm not sure if
we make anything worse, but I do see your concern. :/ (We do have bpf_csum_update()
helper as well. I wonder whether we should split such control into a different
helper.)

Thanks,
Daniel



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux