On 2/10/25 2:57 AM, David Howells wrote: > Ihor Solodrai <ihor.solodrai@xxxxxxxxx> wrote: > >> I recommend trying to reproduce with steps I shared in my initial report: >> https://lore.kernel.org/bpf/a7x33d4dnMdGTtRivptq6S1i8btK70SNBP2XyX_xwDAhLvgQoPox6FVBOkifq4eBinfFfbZlIkMZBe3QarlWTxoEtHZwJCZbNKtaqrR7PvI=@pm.me/ >> >> I know it may not be very convenient due to all the CI stuff, > > That's an understatement. :-) > >> but you should be able to use it to iterate on the kernel source locally and >> narrow down the problem. > > Can you share just the reproducer without all the docker stuff? I wrote a couple of shell scripts with a gist of what's happening on CI: build kernel, build selftests and run. You may try them. Pull this branch from my github: https://github.com/theihor/bpf/tree/netfs-debug It's the kernel source in a broken state with the scripts. Inlining the scripts here: ## ./reproducer.sh #!/bin/bash set -euo pipefail export KBUILD_OUTPUT=$(realpath kbuild-output) mkdir -p $KBUILD_OUTPUT cp -f repro.config $KBUILD_OUTPUT/.config make olddefconfig make -j$(nproc) all make -j$(nproc) headers # apt install lsb-release wget software-properties-common gnupg # bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)" export LLVM_VERSION=18 make -C tools/testing/selftests/bpf \ CLANG=clang-${LLVM_VERSION} \ LLC=llc-${LLVM_VERSION} \ LLVM_STRIP=llvm-strip-${LLVM_VERSION} \ -j$(nproc) test_progs-no_alu32 # wget https://github.com/danobi/vmtest/releases/download/v0.15.0/vmtest-x86_64 # chmod +x vmtest-x86_64 ./vmtest-x86_64 -k $KBUILD_OUTPUT/$(make -s image_name) ./run-bpf-selftests.sh | tee test.log ## end of ./reproducer.sh ## ./run-bpf-selftests.sh #!/bin/bash /bin/mount bpffs /sys/fs/bpf -t bpf ip link set lo up echo 10 > /proc/sys/kernel/hung_task_timeout_secs echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_read/enable echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_write/enable echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_write_iter/enable echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_rreq_ref/enable echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_sreq_ref/enable echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_failure/enable function tail_proc { src=$1 dst=$2 echo -n > $dst while true; do echo >> $dst cat $src >> $dst sleep 1 done } export -f tail_proc nohup bash -c 'tail_proc /proc/fs/netfs/stats netfs-stats.log' & disown nohup bash -c 'tail_proc /proc/fs/netfs/requests netfs-requests.log' & disown nohup bash -c 'trace-cmd show -p > trace-cmd.log' & disown cd tools/testing/selftests/bpf ./test_progs-no_alu32 ## end of ./run-bpf-selftests.sh One of the reasons for suggesting docker is that all the dependencies are pre-packaged in the image, and so the environment is pretty close to the actual CI environment. With only shell scripts you will have to detect and install missing dependencies on your system and hope package versions are more or less the same and don't affect the issue. Notable things: LLVM 18, pahole, qemu, qemu-guest-agent, vmtest tool. > Is this one > of those tests that requires 9p over virtio? I have a different environment > for that. We run the tests via vmtest tool: https://github.com/danobi/vmtest This is essentially a qemu wrapper. I am not familiar with its internals, but for sure it is using 9p. On 2/10/25 3:12 AM, David Howells wrote: > Ihor Solodrai <ihor.solodrai@xxxxxxxxx> wrote: > >> Bash piece starting a process collecting /proc/fs/netfs/stats: >> >> function tail_netfs { >> echo -n > /mnt/vmtest/netfs-stats.log >> while true; do >> echo >> /mnt/vmtest/netfs-stats.log >> cat /proc/fs/netfs/stats >> /mnt/vmtest/netfs-stats.log >> sleep 1 >> done >> } >> export -f tail_netfs >> nohup bash -c 'tail_netfs' & disown > > I'm afraid, intermediate snapshots of this file aren't particularly useful - > just the last snapshot: The reason I wrote it like this is because the test runner hangs, and so I have to kill qemu to stop it (with no ability to run post-processing within qemu instance; well, at least I don't know how to do it). > > [...] > > Could you collect some tracing: > > echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_read/enable > echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_write/enable > echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_write_iter/enable > echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable > echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_rreq_ref/enable > echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable > echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_sreq_ref/enable > echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_failure/enable > > and then collect the tracelog: > > trace-cmd show | bzip2 >some_file_somewhere.bz2 > > And if you could collect /proc/fs/netfs/requests as well, that will show the > debug IDs of the hanging requests. These can be used to grep the trace by > prepending "R=". For example, if you see: > > REQUEST OR REF FL ERR OPS COVERAGE > ======== == === == ==== === ========= > 00000043 WB 1 2120 0 0 @34000000 0/0 > > then: > > trace-cmd show | grep R=00000043 Done. I pushed the logs to the previously mentioned github branch: https://github.com/kernel-patches/bpf/commit/699a3bb95e2291d877737438fb641628702fd18f Let me know if I can help with anything else. > > Thanks, > David >