On Thu, Nov 17, 2022 at 7:58 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote: > > Hello Eric, > > On Wed, Nov 16, 2022 at 07:44:41PM -0800, Eric Dumazet wrote: > > On Wed, Nov 16, 2022 at 7:16 PM Joel Fernandes (Google) > > <joel@xxxxxxxxxxxxxxxxx> wrote: > > > > > > In a networking test on ChromeOS, we find that using the new CONFIG_RCU_LAZY > > > causes a networking test to fail in the teardown phase. > > > > > > The failure happens during: ip netns del <name> > > > > And ? What happens then next ? > > The test is doing the 'ip netns del <name>' and then polling for the > disappearance of a network interface name for upto 5 seconds. I believe it is > using netlink to get a table of interfaces. That polling is timing out. > > Here is some more details from the test's owner (copy pasting from another > bug report): > In the cleanup, we remove the netns, and thus will cause the veth pair being > removed automatically, so we use a poll to check that if the veth in the root > netns still exists to know whether the cleanup is done. > > Here is a public link to the code that is failing (its in golang): > https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/platform/tast-tests/src/chromiumos/tast/local/network/virtualnet/env/env.go;drc=6c2841d6cc3eadd23e07912ec331943ee33d7de8;l=161 > > Here is a public link to the line of code in the actual test leading up to the above > path (this is the test that is run: > network.RoutingFallthrough.ipv4_only_primary) : > https://source.chromium.org/chromiumos/chromiumos/codesearch/+/main:src/platform/tast-tests/src/chromiumos/tast/local/bundles/cros/network/routing_fallthrough.go;drc=8fbf2c53960bc8917a6a01fda5405cad7c17201e;l=52 > > > > Using ftrace, I found the callbacks it was queuing which this series fixes. Use > > > call_rcu_flush() to revert to the old behavior. With that, the test passes. > > > > What is this test about ? What barrier was used to make it not flaky ? > > I provided the links above, let me know if you have any questions. > > > Was it depending on some undocumented RCU behavior ? > > This is a new RCU feature posted here for significant power-savings on > battery-powered devices: > https://lore.kernel.org/rcu/20221017140726.GG5600@paulmck-ThinkPad-P17-Gen-1/T/#m7a54809b8903b41538850194d67eb34f203c752a > > There is also an LPC presentation about the same, I can dig the link if you > are interested. > > > Maybe adding a sysctl to force the flush would be better for functional tests ? > > > > I would rather change the test(s), than adding call_rcu_flush(), > > adding merge conflicts to future backports. > > I am not too sure about that, I think a user might expect the network > interface to disappear from the networking tables quickly enough without > dealing with barriers or kernel iternals. However, I added the authors of the > test to this email in the hopes he can provide is point of views as well. > > The general approach we are taking with this sort of thing is to use > call_rcu_flush() which is basically the same as call_rcu() for systems with > CALL_RCU_LAZY=n. You can see some examples of that in the patch series link > above. Just to note, CALL_RCU_LAZY depends on CONFIG_RCU_NOCB_CPU so its only > Android and ChromeOS that are using it. I am adding Jie to share any input, > he is from the networking team and knows this test well. > > I do not know what is this RCU_LAZY thing, but IMO this should be opt-in For instance, only kfree_rcu() should use it. We can not review hundreds of call_rcu() call sites and decide if adding arbitrary delays cou hurt .