January 28, 2025 at 1:33 AM, "David Howells" <dhowells@xxxxxxxxxx> wrote: > > Here's an additional patch to allow stats on the number of retries to be > > obtained. This isn't a fix per se. > > David > > [...] Hi David, Marc. I regret to report that this patch didn't fix the hanging in v9fs_evict_inode when running selftests/bpf. Here is what I tried to test: * Checked out latest bpf-next source tree (0fc5dddb9409) * Applied patch: https://lore.kernel.org/netfs/3173328.1738024385@xxxxxxxxxxxxxxxxxxxxxx/ * Applied retry stats patch: https://lore.kernel.org/netfs/3187377.1738056789@xxxxxxxxxxxxxxxxxxxxxx/ * Modified tools/testing/selftests/bpf/config to enable /proc/fs/netfs/stats * Modified CI scripts to collect the stats * Ran the shell script reproducing the CI testing pipeline Bash piece starting a process collecting /proc/fs/netfs/stats: function tail_netfs { echo -n > /mnt/vmtest/netfs-stats.log while true; do echo >> /mnt/vmtest/netfs-stats.log cat /proc/fs/netfs/stats >> /mnt/vmtest/netfs-stats.log sleep 1 done } export -f tail_netfs nohup bash -c 'tail_netfs' & disown Last recored /proc/fs/netfs/stats (note 0 retries): Reads : DR=0 RA=15184 RF=5 RS=0 WB=0 WBZ=0 Writes : BW=488 WT=0 DW=0 WP=488 2C=0 ZeroOps: ZR=7964 sh=0 sk=0 DownOps: DL=15189 ds=15189 df=0 di=0 CaRdOps: RD=0 rs=0 rf=0 UpldOps: UL=488 us=488 uf=0 CaWrOps: WR=0 ws=0 wf=0 Retries: rq=0 rs=0 wq=0 ws=0 Objs : rr=2 sr=1 foq=1 wsc=0 WbLock : skip=0 wait=0 -- FS-Cache statistics -- Cookies: n=0 v=0 vcol=0 voom=0 Acquire: n=0 ok=0 oom=0 LRU : n=0 exp=0 rmv=0 drp=0 at=0 Invals : n=0 Updates: n=0 rsz=0 rsn=0 Relinqs: n=0 rtr=0 drop=0 NoSpace: nwr=0 ncr=0 cull=0 IO : rd=0 wr=0 mis=0 The stack on hung task hasn't changed: [ 184.375149] INFO: task modprobe:2759 blocked for more than 20 seconds. [ 184.376149] Tainted: G OE 6.13.0-gbb67a65a921c-dirty #3 [ 184.376593] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 184.377119] task:modprobe state:D stack:0 pid:2759 tgid:2759 ppid:455 flags:0x00004002 [ 184.377701] Call Trace: [ 184.377886] <TASK> [ 184.378039] __schedule+0xa91/0xe80 [ 184.378282] schedule+0x41/0xb0 [ 184.378490] v9fs_evict_inode+0xfe/0x170 [ 184.378754] ? __pfx_var_wake_function+0x10/0x10 [ 184.379070] evict+0x1ef/0x360 [ 184.379288] __dentry_kill+0xb0/0x220 [ 184.379528] ? dput+0x3a/0x1d0 [ 184.379736] dput+0x114/0x1d0 [ 184.379946] __fput+0x136/0x2b0 [ 184.380158] task_work_run+0x89/0xc0 [ 184.380396] do_exit+0x2c6/0x9c0 [ 184.380617] do_group_exit+0xa4/0xb0 [ 184.380870] __x64_sys_exit_group+0x17/0x20 [ 184.381137] x64_sys_call+0x21a0/0x21a0 [ 184.381386] do_syscall_64+0x79/0x120 [ 184.381630] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 184.381969] RIP: 0033:0x7f817bf7c21d [ 184.382202] RSP: 002b:00007fff92c8d148 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7 [ 184.382676] RAX: ffffffffffffffda RBX: 00007f817c092fa8 RCX: 00007f817bf7c21d [ 184.383138] RDX: 00000000000000e7 RSI: ffffffffffffff88 RDI: 0000000000000001 [ 184.383582] RBP: 00007fff92c8d1a0 R08: 00007fff92c8d0e8 R09: 0000000000000000 [ 184.384042] R10: 00007fff92c8d05f R11: 0000000000000206 R12: 0000000000000001 [ 184.384486] R13: 0000000000000000 R14: 0000000000000001 R15: 00007f817c092fc0 [ 184.384963] </TASK> [ 184.385112] [ 184.385112] Showing all locks held in the system: [ 184.385499] 1 lock held by khungtaskd/32: [ 184.385793] #0: ffffffff9d195d90 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x2e/0x180 [ 184.386366] 2 locks held by kworker/u8:10/455: [ 184.386649] #0: ffffa1a240104d48 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_scheduled_works+0x23a/0x600 [ 184.387357] #1: ffffb06380a23e20 ((work_completion)(&sub_info->work)){+.+.}-{0:0}, at: process_scheduled_works+0x25a/0x600 [ 184.388076] [ 184.388183] ============================================= I pushed full logs to github: https://github.com/kernel-patches/bpf/commit/88c0d0e1692b04c0d54b7c1941003758d23e0d6a I recommend trying to reproduce with steps I shared in my initial report: https://lore.kernel.org/bpf/a7x33d4dnMdGTtRivptq6S1i8btK70SNBP2XyX_xwDAhLvgQoPox6FVBOkifq4eBinfFfbZlIkMZBe3QarlWTxoEtHZwJCZbNKtaqrR7PvI=@pm.me/ I know it may not be very convenient due to all the CI stuff, but you should be able to use it to iterate on the kernel source locally and narrow down the problem. I have everything set up, so you also might share some debugging code for me to run if you prefer. Thanks. --- Not directly related, but it took me a while to figure out how to collect the netfs stats. I first added: CONFIG_NETFS_DEBUG=y CONFIG_NETFS_STATS=y But that didn't work, because /proc/fs/netfs/stats is created only with CONFIG_FSCACHE_STATS (fs/netfs/main.c): #ifdef CONFIG_FSCACHE_STATS if (!proc_create_single("fs/netfs/stats", S_IFREG | 0444, NULL, netfs_stats_show)) goto error_procfile; #endif And that depends on CONFIG_FSCACHE=y, so I ended up with: CONFIG_FSCACHE=y CONFIG_FSCACHE_STATS=y CONFIG_NETFS_SUPPORT=y CONFIG_NETFS_DEBUG=y CONFIG_NETFS_STATS=y