Hi Valentin, Sorry for my delay! On Mon, Nov 04, 2024 at 02:38:20PM +0100, Valentin Kleibel wrote: > Hi Joey, > > > > We've tested your patch on our servers and ran into an issue. > > > With heavy I/O load the aoe device had stale I/Os (e.g. rsync waiting > > > indefinetly on one core) that can be "fixed" by running aoe-revalidate on > > > that device. > [...]> For the reference count debugging, I have sent a patch series here: > > > > [RFC PATCH 0/2] tracking the references of net_device in aoe > > https://lore.kernel.org/lkml/20241002040616.25193-1-jlee@xxxxxxxx/T/#t > > > > Base on my testing, the number of dev_hold(nd) and dev_put(nd) are balance > > in aoe after the this 'aoe: fix the potential use-after-free problem in more places' > > patch be applied on v6.11 kernel. I have tested add/modify/delete files in remote > > target by aoe. My testing is not a heavy I/O testing. But the result is > > balance. > > > > Could you please help to try the above debug patch series for looking at the > > refcnt value in aoe in your side? > > Thanks for your work, i can confirm refcnt value is balanced and the issue > is fixed now. > Great! Thanks for your testing! > However, the I/O waiting issue reported before is still there, and occurs > more often now. > This problem started with the first patch CVE-2023-6270 applied in commit > f98364e92662. > This only happens with heavy I/O on our "older" storage systems with > spinning disks. Unfortunately we do not know how we could debug this, have > you got any hints what we could do? OK, spinning disk is good information. Could you please give more information about your environment? e.g. CPU number, storage size shared by aoe? how heavy of your I/O? If the situation can be reproduced, then I think that perf can be used to analyze bottleneck. Regards Joey Lee