----- Original Message ----- > Hi Jan, > > Jan Stancek <jstancek@xxxxxxxxxx> writes: > > ----- Original Message ----- > >> > >> Hello, > >> > >> We ran automated tests on a recent commit from this kernel tree: > >> > >> Kernel repo: > >> git://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git > >> Commit: 3b5f97139acc - KVM: PPC: Book3S HV: Flush link stack > >> on > >> guest exit to host kernel > > I can't find this commit, I assume it's roughly the same as: > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git/commit/?h=linux-5.3.y&id=0815f75f90178bc7e1933cf0d0c818b5f3f5a20c Hi, yes, that looks like same one: https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git/commit/?h=3b5f97139acc Looking at CKI reports for past 2 weeks, there were 3 (unexplained) SIGBUS related failures: 5.3.13-3b5f971.cki@upstream-stable LTP genpower Bus error 5.4.0-rc8-4b17a56.cki@upstream-stable LTP genatan Bus error 5.3.11-200.fc30 xfstests +/var/lib/xfstests/tests/generic/248: line 38: 161943 Bus error (core dumped) $TEST_PROG $TESTFILE All 3 are from ppc64le, all power9 systems. > > >> The results of these automated tests are provided below. > >> > >> Overall result: FAILED (see details below) > >> Merge: OK > >> Compile: OK > >> Tests: FAILED > >> > >> All kernel binaries, config files, and logs are available for download > >> here: > >> > >> https://artifacts.cki-project.org/pipelines/314344 > >> > >> One or more kernel tests failed: > >> > >> ppc64le: > >> ❌ LTP > > > > I suspect kernel bug. > > Looks that way, but I can't reproduce it on a machine here. > > I have the same CPU revision and am booting the exact kernel binary & > modules linked above. I can semi-reliably reproduce it with: (where LTP is installed to /mnt/testarea/ltp) while [ True ]; do echo 3 > /proc/sys/vm/drop_caches rm -f /mnt/testarea/ltp/results/RUNTEST.log /mnt/testarea/ltp/output/RUNTEST.run.log ./runltp -p -d results -l RUNTEST.log -o RUNTEST.run.log -f math grep FAIL /mnt/testarea/ltp/results/RUNTEST.log && exit 1 done and some stress activity in other terminal (e.g. kernel build). Sometimes in minutes, sometimes in hours. I did try couple older kernels and could reproduce it with v4.19 and v5.0 as well. v4.18 ran OK for 2 hours, assuming that one is good, it could be related to xfs switching to iomap in 4.19-rc1. Tracing so far led me to filemap_fault(), where it reached this -EIO, before returning SIGBUS. page_not_uptodate: /* * Umm, take care of errors if the page isn't up-to-date. * Try to re-read it _once_. We do this synchronously, * because there really aren't any performance issues here * and we need to check for errors. */ ClearPageError(page); fpin = maybe_unlock_mmap_for_io(vmf, fpin); error = mapping->a_ops->readpage(file, page); if (!error) { wait_on_page_locked(page); if (!PageUptodate(page)) error = -EIO; } ... return VM_FAULT_SIGBUS; > > > There were couple of 'math' runtest related failures in recent couple days. > > In all cases, some data file used by test was missing. Presumably because > > binary that generates it crashed. > > > > I managed to reproduce one failure with this CKI build, which I believe > > is the same problem. > > > > We crash early during load, before any LTP code runs: > > > > (gdb) r > > Starting program: /mnt/testarea/ltp/testcases/bin/genasin > > What is this /mnt/testarea? Looks like it's setup by some of the beaker > scripts or something? Correct, it's where beaker script installs LTP. It's not a real mount, just a directory on /. In my case it's xfs. It should match default Fedora-31 Server ppc64le installation. > > I'm running LTP out of /home, which is ext4 directly on disk. > > I tried getting the tests-beaker stuff working on my machine, but I > couldn't find all the libraries and so on it requires. > > > > Program received signal SIGBUS, Bus error. > > dl_main (phdr=0x10000040, phnum=<optimized out>, user_entry=0x7fffffffe760, > > auxv=<optimized out>) at rtld.c:1362 > > 1362 switch (ph->p_type) > > (gdb) bt > > #0 dl_main (phdr=0x10000040, phnum=<optimized out>, > > user_entry=0x7fffffffe760, auxv=<optimized out>) at rtld.c:1362 > > #1 0x00007ffff7fcf3c8 in _dl_sysdep_start (start_argptr=<optimized out>, > > dl_main=0x7ffff7fb37b0 <dl_main>) at ../elf/dl-sysdep.c:253 > > #2 0x00007ffff7fb1d1c in _dl_start_final (arg=arg@entry=0x7fffffffee20, > > info=info@entry=0x7fffffffe870) at rtld.c:445 > > #3 0x00007ffff7fb2f5c in _dl_start (arg=0x7fffffffee20) at rtld.c:537 > > #4 0x00007ffff7fb14d8 in _start () from /lib64/ld64.so.2 > > (gdb) f 0 > > #0 dl_main (phdr=0x10000040, phnum=<optimized out>, > > user_entry=0x7fffffffe760, auxv=<optimized out>) at rtld.c:1362 > > 1362 switch (ph->p_type) > > (gdb) l > > 1357 /* And it was opened directly. */ > > 1358 ++main_map->l_direct_opencount; > > 1359 > > 1360 /* Scan the program header table for the dynamic section. */ > > 1361 for (ph = phdr; ph < &phdr[phnum]; ++ph) > > 1362 switch (ph->p_type) > > 1363 { > > 1364 case PT_PHDR: > > 1365 /* Find out the load address. */ > > 1366 main_map->l_addr = (ElfW(Addr)) phdr - ph->p_vaddr; > > > > (gdb) p ph > > $1 = (const Elf64_Phdr *) 0x10000040 > > > > (gdb) p *ph > > Cannot access memory at address 0x10000040 > > > > (gdb) info proc map > > process 1110670 > > Mapped address spaces: > > > > Start Addr End Addr Size Offset objfile > > 0x10000000 0x10010000 0x10000 0x0 > > /mnt/testarea/ltp/testcases/bin/genasin > > 0x10010000 0x10030000 0x20000 0x0 > > /mnt/testarea/ltp/testcases/bin/genasin > > 0x7ffff7f90000 0x7ffff7fb0000 0x20000 0x0 [vdso] > > 0x7ffff7fb0000 0x7ffff7fe0000 0x30000 0x0 > > /usr/lib64/ld-2.30.so > > 0x7ffff7fe0000 0x7ffff8000000 0x20000 0x20000 > > /usr/lib64/ld-2.30.so > > 0x7ffffffd0000 0x800000000000 0x30000 0x0 [stack] > > > > (gdb) x/1x 0x10000040 > > 0x10000040: Cannot access memory at address 0x10000040 > > Yeah that's weird. > > > # /mnt/testarea/ltp/testcases/bin/genasin > > Bus error (core dumped) > > > > However, as soon as I copy that binary somewhere else, it works fine: > > > > # cp /mnt/testarea/ltp/testcases/bin/genasin /tmp > > # /tmp/genasin > > # echo $? > > 0 > > Is /tmp a real disk or tmpfs? tmpfs Filesystem Type 1K-blocks Used Available Use% Mounted on devtmpfs devtmpfs 254530176 0 254530176 0% /dev tmpfs tmpfs 267992768 0 267992768 0% /dev/shm tmpfs tmpfs 267992768 9152 267983616 1% /run /dev/mapper/fedora_ibm--p9b--03-root xfs 15718400 13029284 2689116 83% / tmpfs tmpfs 267992768 0 267992768 0% /tmp /dev/sda1 xfs 1038336 944588 93748 91% /boot tmpfs tmpfs 53598528 0 53598528 0% /run/user/0