Re: BPF/XDP: kernel panic when removing an interface that is an xdp_redirect target

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Tue, 07 Nov 2023 16:31:21 +0100

"Nelson, Shannon" <shannon.nelson@xxxxxxx> writes:

> While testing new code to support XDP in the ionic driver we found that 
> we could panic the kernel by running a bind/unbind loop on the target 
> interface of an xdp_redirect action.  Obviously this is a stress test 
> that is abusing the system, but it does point to a window of opportunity 
> in bq_enqueue() and bq_xmit_all().  I believe that while the validity of 
> the target interface has been checked in __xdp_enqueue(), the interface 
> can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to 
> use the interface.  There is no locking or reference taken on the 
> interface to hold it in place before the target’s ndo_xdp_xmit() is called.
>
> Below is a stack trace that our tester captured while running our test 
> code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a 
> non-upstream kernel.  But if you look at the current upstream code in 
> kernel/bpf/devmap.c I think you can see what we ran into.
>
> Other than telling users to not abuse the system with a bind/unbind 
> loop, is there something we can do to limit the potential pain here? 
> Without knowing what interfaces might be targeted by the users’ XDP 
> programs, is there a step the originating driver can do to take 
> precautions?  Did we simply miss a step in the driver, or is this an 
> actual problem in the devmap code?

Sounds like a driver bug :)

The XDP redirect flow guarantees that all outstanding packets are
flushed within a single NAPI cycle, as documented here:
https://docs.kernel.org/bpf/redirect.html

So basically, the driver should be doing a two-step teardown: remove
global visibility of the resource in question, wait for all concurrent
users to finish, and *then* free the data structure. This corresponds to
the usual RCU protection: resources should be kept alive until all
concurrent RCU critical sections have exited on all CPUs. So if your
driver is removing an interface's data structure without waiting for
concurrent NAPI cycles to finish, that's a bug in the driver.

This kind of thing is what the synchronize_net() function is for; for a
usage example, see veth_napi_del_range(). My guess would be that you're
missing this as part of your driver teardown flow?

Another source of a bug like this could be that your driver does not in
fact call xdp_do_flush() before exiting its NAPI cycle, so that there
will be packets from the previous cycle in the bq queue, in which case
the assumption mentioned in the linked document obviously breaks down.
But that would also be a driver bug :)

-Toke