> On 22 Mar 2022, at 18:14, Helge Deller <deller@xxxxxx> wrote: > > Hi Sam, > Hi Helge! > On 3/22/22 18:52, Sam James wrote: >> In Gentoo, we've just got our hands on an RP3440 (PA8800) which seems to quite easily hit inequivalent aliasing issues. >> >> We've found that under some workloads, the machine copes fine, none of that appears in dmesg, and all is well - even for >> over a week. But as soon as we start other workloads (the problematic one is building "stages" -- release media for Gentoo), >> within 30m or so, the machine is in a broken state, with these messages flooding dmesg: >> ``` >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x42994000 and 0x426e1000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x426e1000 and 0x41b56000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41b56000 and 0x41aae000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41aae000 and 0x42774000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x42774000 and 0x41202000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41202000 and 0x428dd000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41e2c000 and 0x418f6000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x418f6000 and 0x42980000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x42980000 and 0x426cd000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x426cd000 and 0x41b42000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41b42000 and 0x41a9a000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41a9a000 and 0x42760000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x42760000 and 0x411ee000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x411ee000 and 0x428c9000 in file bash >> ``` >> >> When it's in this state, GCC ends up ICEing at some point and other userland command fails too (e.g. last night >> I tried unpacking a kernel and 'xz' failed the first time, but worked the second). It might be of note that I think >> the failures end up happening during a HPPA 1.1 build. >> >> I appreciate this isn't really enough information to solve the problem, but I'm not sure what I need to obtain: >> any suggestions for how to debug this further & get more information to better receive assistance would be most welcome. >> >> The machine is currently running 5.17.0 along with Helge's tree up to (and including) Linus's pull for 5.18.0 >> (https://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git/commit/?h=for-next&id=a04b1bf574e1f4875ea91f5c62ca051666443200). > > The INEQUIVALENT ALIASES messages are most likely not related to the instability > of your machine. I see them randomly on the debian buildd servers as well. > Understood. The weird thing is, they only happen when the "bad things" start, but I understand it may be unrelated or just a side-effect of the real issue. When we see a handful of them, things are usually fine, but when the "issues start", there's 100s of lines of this in dmesg for all sorts of processes (but mainly bash). > Instead of using the latest (development) kernels, I'd suggest that you > first try with a "stable" kernel. > On the debian buildd servers I'm currently running Kernel 5.10.106+, which is pretty stable. > I think Dave is running 5.16.x quite ok. I've been trying 5.10.x (5.10.88, 5.10.93), 5.16.x (5.16.5), 5.17.x (rc2, rc7, rc8), and some others. All of them are OK until we start doing more work on the machine. :( Then all break in the same way. > >> We're also using GCC 11.2 (but a snapshot from their stable 11 branch), glibc 2.34 (with latest patches), and latest >> Binutils 2.37 (with patches from upstream again). >> >> I've also attached the running kernel config in case any suggestions can be made there to either aid debugging or >> reduce the chances of this issue occurring. >> >> TL:DR: Lots of inequivalent aliases issues when running certain intensive workloads (but not others?), system ends up >> in a bad state and needs a reboot to function correctly (otherwise userland may misbehave/crash), need more help >> with how to debug/get more information out of it/narrow it down. >> >> Of course, if needed, we can provide access to the machine for kernel maintainers and show them how to induce a broken >> State (or do it for them repeatedly) if we can't find a smaller test case. > > Is there any other output in dmesg which is not INEQUIVALENT ALIASES? > E.g. "stuck processes" messages? > Just checked: nothing else :( Here it is though (I only grepped out sshd to avoid showing user IPs): https://dev.gentoo.org/~sam/bugs/linux-parisc/2022-03-22-dmesg-muta.txt Best, sam
Attachment:
signature.asc
Description: Message signed with OpenPGP