> On 13 Apr 2022, at 00:45, Sam James <sam@xxxxxxxxxx> wrote: > > > >> On 12 Apr 2022, at 14:20, John David Anglin <dave.anglin@xxxxxxxx> wrote: >> >> On 2022-04-12 8:27 a.m., John David Anglin wrote: >>> On 2022-04-12 1:18 a.m., Sam James wrote: >>>>>> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x411ee000 and 0x428c9000 in file bash >>>>>> ``` >>>>> It seems all these messages result from a single call to flush_dcache_page. Note the sequential behavior of old_addr >>>>> and addr, and message times. >>>> FWIW, from Helge's config on 5.10.108 (config changes on my end: just disabling unneeded devices to speed up build), I have the same >>>> horrible wall: >>> This change might help: >>> https://lore.kernel.org/linux-parisc/YlNw8jzP9OQRKvlV@mx3210.localdomain/T/#u >>> >>> It applies on top of Helge's current for-next tree which is based on 5.18.0-rc1+. >>> >>> The messages will no longer appear with this patch on c8000/rp34xx. However, the loop corruption >>> might still occur. If that happens, I think the stall detector will trigger, or maybe some other crash. >>> >>> The loop is changed to flush all mount points on machines with PA8800 or PA8900 processors as I >>> believe these CPUs don't support equivalent aliases. >> >> Thousands of messages aren't useful. I would suggest adding a BUG_ON statement in the loop that >> triggers on the first message. That might help find the circumstances that cause the problem. >> > > Your change *seems* to have prevented the "bad wall"! But now we get some silent runtime corruption > and binaries crashing (5.18.0_rc2 + for-next + your patch). > > So this seems like a good improvement given those crashes happened previously too, although maybe > less often. > > Not sure how to get more debugging info yet, there is nothing helpful in dmesg (no messages at all when > it happens). Suggestions (given it is not hitting that loop)? Spoke slightly too soon: processes dying / corruption happened with v1, and we maybe got a bit longer out of v2, but then issues started again (processes dying, nothing in dmesg). Once the "bad state" happens, the system is generally unreliable. I tried to upgrade man-db and then I got a gcc ICE: ``` /bin/sh ../../libtool --tag=CC --mode=compile hppa2.0-unknown-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../.. -DDEFAULT_TEXT_DOMAIN=\"man-db-gnulib\" -Wno-cast-qual -Wno-conversion -Wno-float-equal -Wno-sign-compare -Wno-undef -Wno-unused-function -Wno-unused-parameter -Wno-float-conversion -Wimplicit-fallthrough -Wno-pedantic -Wno-sign-conversion -Wno-type-limits -Wno-unsuffixed-float-constants -O2 -pipe -march=2.0 -Wall -c -o glthread/libgnu_la-threadlib.lo `test -f 'glthread/threadlib.c' || echo './'`glthread/threadlib.c /bin/sh ../../libtool --tag=CC --mode=compile hppa2.0-unknown-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../.. -DDEFAULT_TEXT_DOMAIN=\"man-db-gnulib\" -Wno-cast-qual -Wno-conversion -Wno-float-equal -Wno-sign-compare -Wno-undef -Wno-unused-function -Wno-unused-parameter -Wno-float-conversion -Wimplicit-fallthrough -Wno-pedantic -Wno-sign-conversion -Wno-type-limits -Wno-unsuffixed-float-constants -O2 -pipe -march=2.0 -Wall -c -o libgnu_la-timespec.lo `test -f 'timespec.c' || echo './'`timespec.c during RTL pass: reload In file included from regex.c:74: regcomp.c: In function ‘parse_expression’: regcomp.c:2421:1: internal compiler error: Segmentation fault 2421 | } | ^ libtool: compile: hppa2.0-unknown-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../.. -DDEFAULT_TEXT_DOMAIN=\"man-db-gnulib\" -Wno-cast-qual -Wno-conversion -Wno-float-equal -Wno-sign-compare -Wno-undef -Wno-unused-function -Wno-unused-parameter -Wno-float-conversion -Wimplicit-fallthrough -Wno-pedantic -Wno-sign-conversion -Wno-type-limits -Wno-unsuffixed-float-constants -O2 -pipe -march=2.0 -Wall -c glthread/threadlib.c -fPIC -DPIC -o glthread/.libs/libgnu_la-threadlib.o libtool: compile: hppa2.0-unknown-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../.. -DDEFAULT_TEXT_DOMAIN=\"man-db-gnulib\" -Wno-cast-qual -Wno-conversion -Wno-float-equal -Wno-sign-compare -Wno-undef -Wno-unused-function -Wno-unused-parameter -Wno-float-conversion -Wimplicit-fallthrough -Wno-pedantic -Wno-sign-conversion -Wno-type-limits -Wno-unsuffixed-float-constants -O2 -pipe -march=2.0 -Wall -c timespec.c -fPIC -DPIC -o .libs/libgnu_la-timespec.o /bin/sh ../../libtool --tag=CC --mode=compile hppa2.0-unknown-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../.. -DDEFAULT_TEXT_DOMAIN=\"man-db-gnulib\" -Wno-cast-qual -Wno-conversion -Wno-float-equal -Wno-sign-compare -Wno-undef -Wno-unused-function -Wno-unused-parameter -Wno-float-conversion -Wimplicit-fallthrough -Wno-pedantic -Wno-sign-conversion -Wno-type-limits -Wno-unsuffixed-float-constants -O2 -pipe -march=2.0 -Wall -c -o libgnu_la-unistd.lo `test -f 'unistd.c' || echo './'`unistd.c /bin/sh ../../libtool --tag=CC --mode=compile hppa2.0-unknown-linux-gnu-gcc -DHAVE_CONFIG_H -I. -I../.. -DDEFAULT_TEXT_DOMAIN=\"man-db-gnulib\" -Wno-cast-qual -Wno-conversion -Wno-float-equal -Wno-sign-compare -Wno-undef -Wno-unused-function -Wno-unused-parameter -Wno-float-conversion -Wimplicit-fallthrough -Wno-pedantic -Wno-sign-conversion -Wno-type-limits -Wno-unsuffixed-float-constants -O2 -pipe -march=2.0 -Wall -c -o libgnu_la-dup-safer.lo `test -f 'dup-safer.c' || echo './'`dup-safer.c 0xf61f7313 __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 0xf61f746f __libc_start_main_impl /var/tmp/portage/sys-libs/glibc-2.34-r11/work/glibc-2.34/csu/libc-start.c:409 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <https://bugs.gentoo.org/> for instructions. make[4]: *** [Makefile:3664: libgnu_la-regex.lo] Error 1 make[4]: *** Waiting for unfinished jobs.... ``` (I don't anticipate this being a genuine ICE, as it only happens when the system becomes "tainted", and is not reproducible after reboots during normal activity.) Best, sam
Attachment:
signature.asc
Description: Message signed with OpenPGP