On Tue, Aug 22, 2023 at 01:40:45PM +0300, Nikolay Aleksandrov wrote: > Thank you for testing, but we really need to understand what is going on and > why the device isn't getting deleted for so long. Currently I don't have the > time to debug it properly (I'll be able to next week at the earliest). We > can't apply the patch based only on tests without understanding the > underlying issue. I'd look into what > the reproducer is doing exactly and also check the system state while the > deadlock has happened. Also you can list the currently held locks (if > CONFIG_LOCKDEP is enabled) via magic sysrq + d for example. See which > process is holding them, what are their priorities and so on. > Try to build some theory of how a deadlock might happen and then go > about proving it. Does the 8021q module have the same problem? It uses > similar code to set its hook. Hi Nik, Thank you so much for the instructions! I was able to obtain a decoded stacktrace showing the reproducer behavior in my QEMU VM running kernel 6.5-rc4, in case that would give us more context for pinpointing the problem. Here's a link to the output: https://pastecat.io/?p=IlKZlflN9j2Z2mspjKe7 Basically, after running the reproducer (line 1854) for about 180 seconnds or so, the unregister_netdevice warning was shown (line 1856), and then after another 50 seconds, the kernel detected that some tasks have been stalled for more than 143 seconds (line 1866), so it panicked on the blocked tasks (line 2116). Before the panic, we did get to see all the locks held in the system (line 2068), and it did show that many processes created by the reproducer were contending the br_ioctl_mutex. I'm now starting to wonder whether this is really a deadlock, or simply some tasks not being able to grab the lock because so many processes are trying to acquire it. Let me know what you think about the situation shown in the above log, and let's keep in touch for any future debugging. Thank you again for guiding me through the problem! Best regards, Ziqi