Hi Valentin, On Thu, Sep 12, 2024 at 12:58:46PM +0200, Valentin Kleibel wrote: > > Then Nicolai Stange found more places in aoe have potential use-after-free > > problem with tx(). e.g. revalidate(), aoecmd_ata_rw(), resend(), probe() > > and aoecmd_cfg_rsp(). Those functions also use aoenet_xmit() to push > > packet to tx queue. So they should also use dev_hold() to increase the > > refcnt of skb->dev. > > We've tested your patch on our servers and ran into an issue. > With heavy I/O load the aoe device had stale I/Os (e.g. rsync waiting > indefinetly on one core) that can be "fixed" by running aoe-revalidate on > that device. > > Additionally when trying to shut down the system we see the message: > unregister_netdevice: waiting for XXX to become free. Usage Count = XXXXX > on aoe devices with a usage count somewhere in the millions. > This has been the same as without the patch, i assume the fix is still > incomplete. > For the reference count debugging, I have sent a patch series here: [RFC PATCH 0/2] tracking the references of net_device in aoe https://lore.kernel.org/lkml/20241002040616.25193-1-jlee@xxxxxxxx/T/#t Base on my testing, the number of dev_hold(nd) and dev_put(nd) are balance in aoe after the this 'aoe: fix the potential use-after-free problem in more places' patch be applied on v6.11 kernel. I have tested add/modify/delete files in remote target by aoe. My testing is not a heavy I/O testing. But the result is balance. Could you please help to try the above debug patch series for looking at the refcnt value in aoe in your side? Thanks a lot! Joey Lee