"Nelson, Shannon" <shannon.nelson@xxxxxxx> writes: > On 11/7/2023 7:31 AM, Toke Høiland-Jørgensen wrote: >> >> "Nelson, Shannon" <shannon.nelson@xxxxxxx> writes: >> >>> While testing new code to support XDP in the ionic driver we found that >>> we could panic the kernel by running a bind/unbind loop on the target >>> interface of an xdp_redirect action. Obviously this is a stress test >>> that is abusing the system, but it does point to a window of opportunity >>> in bq_enqueue() and bq_xmit_all(). I believe that while the validity of >>> the target interface has been checked in __xdp_enqueue(), the interface >>> can be unbound by the time either bq_enqueue() or bq_xmit_all() tries to >>> use the interface. There is no locking or reference taken on the >>> interface to hold it in place before the target’s ndo_xdp_xmit() is called. >>> >>> Below is a stack trace that our tester captured while running our test >>> code on a RHEL 9.2 kernel – yes, I know, unpublished driver code on a >>> non-upstream kernel. But if you look at the current upstream code in >>> kernel/bpf/devmap.c I think you can see what we ran into. >>> >>> Other than telling users to not abuse the system with a bind/unbind >>> loop, is there something we can do to limit the potential pain here? >>> Without knowing what interfaces might be targeted by the users’ XDP >>> programs, is there a step the originating driver can do to take >>> precautions? Did we simply miss a step in the driver, or is this an >>> actual problem in the devmap code? >> >> Sounds like a driver bug :) > > Entirely possible, wouldn't be our first ... :-) > >> >> The XDP redirect flow guarantees that all outstanding packets are >> flushed within a single NAPI cycle, as documented here: >> https://docs.kernel.org/bpf/redirect.html >> >> So basically, the driver should be doing a two-step teardown: remove >> global visibility of the resource in question, wait for all concurrent >> users to finish, and *then* free the data structure. This corresponds to >> the usual RCU protection: resources should be kept alive until all >> concurrent RCU critical sections have exited on all CPUs. So if your >> driver is removing an interface's data structure without waiting for >> concurrent NAPI cycles to finish, that's a bug in the driver. >> >> This kind of thing is what the synchronize_net() function is for; for a >> usage example, see veth_napi_del_range(). My guess would be that you're >> missing this as part of your driver teardown flow? > > Essentially, the first thing we do in the remove function is to call > unregister_netdev(), which has synchronize_net() in the path, so I don't > think this is missing from our scenario, but thanks for the hint, I'll > keep this in mind. I do see there are a couple of net drivers that are > more aggressive about calling it directly in some other parts of the > logic - I don't think that has a bearing on this issue, but I'll keep it > in mind. Hmm, right, in fact unregister_netdev() has two such synchronize_net() calls. The XDP queue is only guaranteed to be flushed after the second one of those, though, and there's an 'ndo_uninit()' callback in-between them. So I don't suppose your driver implements that ndo and does something there that could cause the crash you're seeing? Otherwise, the one thing I can think of is that maybe it can be related to the fact that synchronize_net() turns into a synchronize_rcu_expedited() if the rtnl lock is held (which it is in this case if you're calling the parameter-less unregister_netdev()). I'm not quite sure I grok the expedited wait thing, but it should be pretty easy to check if this is the cause by making a change like the one below and seeing if the issue goes away. -Toke diff --git a/net/core/dev.c b/net/core/dev.c index e28a18e7069b..1a035a5f0b0e 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -10932,7 +10932,7 @@ void synchronize_net(void) { might_sleep(); if (rtnl_is_locked()) - synchronize_rcu_expedited(); + synchronize_rcu(); else synchronize_rcu(); }