Re: [PATCH] netfs: Add retry stat counters

"Ihor Solodrai" <ihor.solodrai@xxxxxxxxx> · Mon, 10 Feb 2025 21:54:19 +0000

On 2/10/25 2:57 AM, David Howells wrote:
> Ihor Solodrai <ihor.solodrai@xxxxxxxxx> wrote:
>
>> I recommend trying to reproduce with steps I shared in my initial report:
>> https://lore.kernel.org/bpf/a7x33d4dnMdGTtRivptq6S1i8btK70SNBP2XyX_xwDAhLvgQoPox6FVBOkifq4eBinfFfbZlIkMZBe3QarlWTxoEtHZwJCZbNKtaqrR7PvI=@pm.me/
>>
>> I know it may not be very convenient due to all the CI stuff,
>
> That's an understatement. :-)
>
>> but you should be able to use it to iterate on the kernel source locally and
>> narrow down the problem.
>
> Can you share just the reproducer without all the docker stuff?  

I wrote a couple of shell scripts with a gist of what's happening on
CI: build kernel, build selftests and run. You may try them.

Pull this branch from my github:
https://github.com/theihor/bpf/tree/netfs-debug

It's the kernel source in a broken state with the scripts.
Inlining the scripts here:

## ./reproducer.sh

#!/bin/bash

set -euo pipefail

export KBUILD_OUTPUT=$(realpath kbuild-output)
mkdir -p $KBUILD_OUTPUT

cp -f repro.config $KBUILD_OUTPUT/.config
make olddefconfig
make -j$(nproc) all
make -j$(nproc) headers

# apt install lsb-release wget software-properties-common gnupg
# bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
export LLVM_VERSION=18

make -C tools/testing/selftests/bpf \
     CLANG=clang-${LLVM_VERSION} \
     LLC=llc-${LLVM_VERSION} \
     LLVM_STRIP=llvm-strip-${LLVM_VERSION} \
     -j$(nproc) test_progs-no_alu32

# wget https://github.com/danobi/vmtest/releases/download/v0.15.0/vmtest-x86_64
# chmod +x vmtest-x86_64
./vmtest-x86_64 -k $KBUILD_OUTPUT/$(make -s image_name) ./run-bpf-selftests.sh | tee test.log

## end of ./reproducer.sh

## ./run-bpf-selftests.sh

#!/bin/bash

/bin/mount bpffs /sys/fs/bpf -t bpf
ip link set lo up

echo 10 > /proc/sys/kernel/hung_task_timeout_secs
echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_read/enable
echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_write/enable
echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_write_iter/enable
echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable
echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_rreq_ref/enable
echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable
echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_sreq_ref/enable
echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_failure/enable

function tail_proc {
    src=$1
    dst=$2
    echo -n > $dst
    while true; do
        echo >> $dst
        cat $src >> $dst
        sleep 1
    done
}
export -f tail_proc

nohup bash -c 'tail_proc /proc/fs/netfs/stats netfs-stats.log' & disown
nohup bash -c 'tail_proc /proc/fs/netfs/requests netfs-requests.log' & disown
nohup bash -c 'trace-cmd show -p > trace-cmd.log' & disown

cd tools/testing/selftests/bpf
./test_progs-no_alu32

## end of ./run-bpf-selftests.sh

One of the reasons for suggesting docker is that all the dependencies
are pre-packaged in the image, and so the environment is pretty close
to the actual CI environment. With only shell scripts you will have to
detect and install missing dependencies on your system and hope
package versions are more or less the same and don't affect the issue.

Notable things: LLVM 18, pahole, qemu, qemu-guest-agent, vmtest tool.

> Is this one
> of those tests that requires 9p over virtio?  I have a different environment
> for that.

We run the tests via vmtest tool: https://github.com/danobi/vmtest
This is essentially a qemu wrapper.

I am not familiar with its internals, but for sure it is using 9p.

On 2/10/25 3:12 AM, David Howells wrote:
> Ihor Solodrai <ihor.solodrai@xxxxxxxxx> wrote:
>
>> Bash piece starting a process collecting /proc/fs/netfs/stats:
>>
>>     function tail_netfs {
>>         echo -n > /mnt/vmtest/netfs-stats.log
>>         while true; do
>>             echo >> /mnt/vmtest/netfs-stats.log
>>             cat /proc/fs/netfs/stats >> /mnt/vmtest/netfs-stats.log
>>             sleep 1
>>         done
>>     }
>>     export -f tail_netfs
>>     nohup bash -c 'tail_netfs' & disown
>
> I'm afraid, intermediate snapshots of this file aren't particularly useful -
> just the last snapshot:

The reason I wrote it like this is because the test runner hangs, and
so I have to kill qemu to stop it (with no ability to run
post-processing within qemu instance; well, at least I don't know how
to do it).

>
> [...]
>
> Could you collect some tracing:
>
> echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_read/enable
> echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_write/enable
> echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_write_iter/enable
> echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable
> echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_rreq_ref/enable
> echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable
> echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_sreq_ref/enable
> echo 1 >/sys/kernel/debug/tracing/events/netfs/netfs_failure/enable
>
> and then collect the tracelog:
>
> trace-cmd show | bzip2 >some_file_somewhere.bz2
>
> And if you could collect /proc/fs/netfs/requests as well, that will show the
> debug IDs of the hanging requests.  These can be used to grep the trace by
> prepending "R=".  For example, if you see:
>
> 	REQUEST  OR REF FL ERR  OPS COVERAGE
> 	======== == === == ==== === =========
> 	00000043 WB   1 2120    0   0 @34000000 0/0
>
> then:
>
> 	trace-cmd show | grep R=00000043

Done. I pushed the logs to the previously mentioned github branch:
https://github.com/kernel-patches/bpf/commit/699a3bb95e2291d877737438fb641628702fd18f

Let me know if I can help with anything else.

>
> Thanks,
> David
>