Re: Crash for i40e on net-next (was: [PATCH v8 bpf-next 00/14] mvneta: introduce XDP multi-buffer support)

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Fri, 23 Apr 2021 09:43:14 -0700

On Thu, Apr 22, 2021 at 10:28 PM Magnus Karlsson
<magnus.karlsson@xxxxxxxxx> wrote:
>
> On Thu, Apr 22, 2021 at 5:05 PM Jesper Dangaard Brouer
> <brouer@xxxxxxxxxx> wrote:
> >
> > On Thu, 22 Apr 2021 16:42:23 +0200
> > Jesper Dangaard Brouer <brouer@xxxxxxxxxx> wrote:
> >
> > > On Thu, 22 Apr 2021 12:24:32 +0200
> > > Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
> > >
> > > > On Wed, Apr 21, 2021 at 5:39 PM Jesper Dangaard Brouer
> > > > <brouer@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, 21 Apr 2021 16:12:32 +0200
> > > > > Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
> > > > >
> > > [...]
> > > > > > more than I get.
> > > > >
> > > > > I clearly have a bug in the i40e driver.  As I wrote later, I don't see
> > > > > any packets transmitted for XDP_TX.  Hmm, I using Mel Gorman's tree,
> > > > > which contains the i40e/ice/ixgbe bug we fixed earlier.
> > >
> > > Something is wrong with i40e, I changed git-tree to net-next (at
> > > commit 5d869070569a) and XDP seems to have stopped working on i40e :-(
>
> Found this out too when switching to the net tree yesterday to work on
> proper packet drop tracing as you spotted/requested yesterday. The
> commit below completely broke XDP support on i40e (if you do not run
> with a zero-copy AF_XDP socket because that path still works). I am
> working on a fix that does not just revert the patch, but fixes the
> original problem without breaking XDP. Will post it and the tracing
> fixes as soon as I can.
>
> commit 12738ac4754ec92a6a45bf3677d8da780a1412b3
> Author: Arkadiusz Kubalewski <arkadiusz.kubalewski@xxxxxxxxx>
> Date:   Fri Mar 26 19:43:40 2021 +0100
>
>     i40e: Fix sparse errors in i40e_txrx.c
>
>     Remove error handling through pointers. Instead use plain int
>     to return value from i40e_run_xdp(...).
>
>     Previously:
>     - sparse errors were produced during compilation:
>     i40e_txrx.c:2338 i40e_run_xdp() error: (-2147483647) too low for ERR_PTR
>     i40e_txrx.c:2558 i40e_clean_rx_irq() error: 'skb' dereferencing
> possible ERR_PTR()
>
>     - sk_buff* was used to return value, but it has never had valid
>     pointer to sk_buff. Returned value was always int handled as
>     a pointer.
>
>     Fixes: 0c8493d90b6b ("i40e: add XDP support for pass and drop actions")
>     Fixes: 2e6893123830 ("i40e: split XDP_TX tail and XDP_REDIRECT map
> flushing")
>     Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@xxxxxxxxx>
>     Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@xxxxxxxxx>
>     Tested-by: Dave Switzer <david.switzer@xxxxxxxxx>
>     Signed-off-by: Tony Nguyen <anthony.l.nguyen@xxxxxxxxx>

Yeah, this patch would horribly break things, especially in the
multi-buffer case. The idea behind using the skb pointer to indicate
the error is that it is persistent until we hit the EOP descriptor.
With that removed you end up mangling the entire list of frames since
it will start trying to process the next frame in the middle of a
packet.

>
> > Renamed subj as this is without this patchset applied.
> >
> > > $ uname -a
> > > Linux broadwell 5.12.0-rc7-net-next+ #600 SMP PREEMPT Thu Apr 22 15:13:15 CEST 2021 x86_64 x86_64 x86_64 GNU/Linux
> > >
> > > When I load any XDP prog almost no packets are let through:
> > >
> > >  [kernel-bpf-samples]$ sudo ./xdp1 i40e2
> > >  libbpf: elf: skipping unrecognized data section(16) .eh_frame
> > >  libbpf: elf: skipping relo section(17) .rel.eh_frame for section(16) .eh_frame
> > >  proto 17:          1 pkt/s
> > >  proto 0:          0 pkt/s
> > >  proto 17:          0 pkt/s
> > >  proto 0:          0 pkt/s
> > >  proto 17:          1 pkt/s
> >
> > Trying out xdp_redirect:
> >
> >  [kernel-bpf-samples]$ sudo ./xdp_redirect i40e2 i40e2
> >  input: 7 output: 7
> >  libbpf: elf: skipping unrecognized data section(20) .eh_frame
> >  libbpf: elf: skipping relo section(21) .rel.eh_frame for section(20) .eh_frame
> >  libbpf: Kernel error message: XDP program already attached
> >  WARN: link set xdp fd failed on 7
> >  ifindex 7:       7357 pkt/s
> >  ifindex 7:       7909 pkt/s
> >  ifindex 7:       7909 pkt/s
> >  ifindex 7:       7909 pkt/s
> >  ifindex 7:       7909 pkt/s
> >  ifindex 7:       7909 pkt/s
> >  ifindex 7:       6357 pkt/s
> >
> > And then it crash (see below) at page_frag_free+0x31 which calls
> > virt_to_head_page() with a wrong addr (I guess).  This is called by
> > i40e_clean_tx_irq+0xc9.
>
> Did not see a crash myself, just 4 Kpps. But the rings and DMA
> mappings got completely mangled by the patch above, so could be the
> same cause.

Are you running with jumbo frames enabled? I would think this change
would really blow things up in the jumbo enabled case.