Hi Joey,
We've tested your patch on our servers and ran into an issue. With heavy I/O load the aoe device had stale I/Os (e.g. rsync waiting indefinetly on one core) that can be "fixed" by running aoe-revalidate on that device.
[...]> For the reference count debugging, I have sent a patch series here:
[RFC PATCH 0/2] tracking the references of net_device in aoe https://lore.kernel.org/lkml/20241002040616.25193-1-jlee@xxxxxxxx/T/#t Base on my testing, the number of dev_hold(nd) and dev_put(nd) are balance in aoe after the this 'aoe: fix the potential use-after-free problem in more places' patch be applied on v6.11 kernel. I have tested add/modify/delete files in remote target by aoe. My testing is not a heavy I/O testing. But the result is balance. Could you please help to try the above debug patch series for looking at the refcnt value in aoe in your side?
Thanks for your work, i can confirm refcnt value is balanced and the issue is fixed now.
However, the I/O waiting issue reported before is still there, and occurs more often now. This problem started with the first patch CVE-2023-6270 applied in commit f98364e92662. This only happens with heavy I/O on our "older" storage systems with spinning disks. Unfortunately we do not know how we could debug this, have you got any hints what we could do?
Thanks, Valentin PS: sorry for the delay, I'm now back from a long vacation