> On 5 Apr 2022, at 22:13, John David Anglin <dave.anglin@xxxxxxxx> wrote: > > On 2022-03-22 1:52 p.m., Sam James wrote: >> In Gentoo, we've just got our hands on an RP3440 (PA8800) which seems to quite easily hit inequivalent aliasing issues. >> >> We've found that under some workloads, the machine copes fine, none of that appears in dmesg, and all is well - even for >> over a week. But as soon as we start other workloads (the problematic one is building "stages" -- release media for Gentoo), >> within 30m or so, the machine is in a broken state, with these messages flooding dmesg: >> ``` >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x42994000 and 0x426e1000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x426e1000 and 0x41b56000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41b56000 and 0x41aae000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41aae000 and 0x42774000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x42774000 and 0x41202000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41202000 and 0x428dd000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41e2c000 and 0x418f6000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x418f6000 and 0x42980000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x42980000 and 0x426cd000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x426cd000 and 0x41b42000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41b42000 and 0x41a9a000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x41a9a000 and 0x42760000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x42760000 and 0x411ee000 in file bash >> Mar 22 04:19:55 muta.hppa.dev.gentoo.org kernel: INEQUIVALENT ALIASES 0x411ee000 and 0x428c9000 in file bash >> ``` > It seems all these messages result from a single call to flush_dcache_page. Note the sequential behavior of old_addr > and addr, and message times. FWIW, from Helge's config on 5.10.108 (config changes on my end: just disabling unneeded devices to speed up build), I have the same horrible wall: [...] [909453.077034] INEQUIVALENT ALIASES 0x41e26000 and 0x42ae2000 in file gmake [909453.079829] INEQUIVALENT ALIASES 0x428a7000 and 0x41971000 in file python3.9 [909453.084697] INEQUIVALENT ALIASES 0x41d11000 and 0x41add000 in file gmake [909453.084934] INEQUIVALENT ALIASES 0x41add000 and 0x418f4000 in file gmake [909453.085426] INEQUIVALENT ALIASES 0x418f4000 and 0x41ded000 in file gmake [909453.085658] INEQUIVALENT ALIASES 0x41ded000 and 0x42aa9000 in file gmake [909453.093396] INEQUIVALENT ALIASES 0x41d2c000 and 0x41af8000 in file gmake [909453.093630] INEQUIVALENT ALIASES 0x41af8000 and 0x4190f000 in file gmake [909453.094390] INEQUIVALENT ALIASES 0x4190f000 and 0x41e08000 in file gmake [909453.094621] INEQUIVALENT ALIASES 0x41e08000 and 0x42ac4000 in file gmake [909453.096778] INEQUIVALENT ALIASES 0x41d2d000 and 0x41af9000 in file gmake [909453.098128] INEQUIVALENT ALIASES 0x41af9000 and 0x41910000 in file gmake [909453.098361] INEQUIVALENT ALIASES 0x41910000 and 0x41e09000 in file gmake [909453.099649] INEQUIVALENT ALIASES 0x41e09000 and 0x42ac5000 in file gmake [909453.099897] INEQUIVALENT ALIASES 0x41d2a000 and 0x41af6000 in file gmake [909453.102098] INEQUIVALENT ALIASES 0x41af6000 and 0x4190d000 in file gmake [909453.103649] INEQUIVALENT ALIASES 0x4190d000 and 0x41e06000 in file gmake [909453.103649] INEQUIVALENT ALIASES 0x41e06000 and 0x42ac2000 in file gmake [909453.176099] INEQUIVALENT ALIASES 0x41d26000 and 0x41af2000 in file gmake [909453.176332] INEQUIVALENT ALIASES 0x41af2000 and 0x41909000 in file gmake [909453.176781] INEQUIVALENT ALIASES 0x41909000 and 0x41e02000 in file gmake [909453.177011] INEQUIVALENT ALIASES 0x41e02000 and 0x42abe000 in file gmake [909453.179720] INEQUIVALENT ALIASES 0x41d4c000 and 0x41b18000 in file gmake [909453.182175] INEQUIVALENT ALIASES 0x41b18000 and 0x4192f000 in file gmake [...] [a while later] [965092.169806] do_page_fault() command='conftest' type=15 address=0x00000000 in libc-2.33.so[f8418000+17b000] trap #15: Data TLB miss fault [965092.170490] CPU: 0 PID: 1786 Comm: conftest Tainted: G E 5.10.108 #1 [965092.170498] Hardware name: 9000/800/rp3440 [965092.170514] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI [965092.170524] PSW: 00000000000001001111111100001111 Tainted: G E [965092.170535] r00-03 000000ff0004ff0f 00000000f8597400 00000000f84a9e73 0000000000000003 [965092.170567] r04-07 00000000f8597c00 000000004249cd10 0000000000000010 0000000000000038 [965092.170577] r08-11 00000000f90e3be0 00000000f90e3bc8 000000004249cd10 0000000000000002 [965092.170588] r12-15 00000000f90e39c8 000000004249c5e0 0000000000000007 00000000f90e3a2c [965092.170599] r16-19 0000000000000002 00000000f90e3a24 0000000000000000 00000000f8597c00 [965092.170610] r20-23 0000000000000003 0000000000000000 00000000f90e3be0 000000004249cd10 [965092.170621] r24-27 0000000000000004 0000000000000018 000000004249cd10 0000000041d34000 [965092.170632] r28-31 000000004249cd10 0000000000000000 00000000f90e3d80 0000000000000000 [965092.170642] sr00-03 000000000ed70800 0000000000000000 0000000000000000 000000000ed70800 [965092.170653] sr04-07 000000000ed70800 000000000ed70800 000000000ed70800 000000000ed70800 [965092.170668] VZOUICununcqcqcqcqcqcrmunTDVZOUI [965092.170677] FPSR: 00000000000000000000000000000000 [965092.170685] FPER1: 00000000 [965092.170696] fr00-03 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [965092.170707] fr04-07 bff0000000000000 41d88c07fdc00000 3fe999999959554e 4001c28f5c28f5c3 [965092.170718] fr08-11 0000000000000000 8000000000000000 bfe5555555555560 bfe5555555555560 [965092.170729] fr12-15 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [965092.170740] fr16-19 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [965092.170751] fr20-23 0000000000000000 0000000000000000 0000000000000038 00002c6300000032 [965092.170762] fr24-27 0000008000000000 0000000000000000 3fd56217fdb2473a 3fdffffffffffdbd [965092.170773] fr28-31 bfe0000000000001 3d2ef35793c76730 3fd555555551305b 8000000000000000 [965092.170790] IASQ: 000000000ed70800 000000000ed70800 IAOQ: 00000000f84aa023 00000000f84aa027 [965092.170799] IIR: 0ea810b4 ISR: 000000000ed70800 IOR: 0000000000000000 [965092.170809] CPU: 0 CR30: 000000017655c000 CR31: ffffffffdbfbffff [965092.170817] ORIG_R28: 0000000000000000 [965092.170825] IAOQ[0]: 00000000f84aa023 [965092.170834] IAOQ[1]: 00000000f84aa027 [965092.170842] RP(r2): 00000000f84a9e73 [967521.206030] INEQUIVALENT ALIASES 0x4124f000 and 0x41f24000 in file ccache [967521.206325] INEQUIVALENT ALIASES 0x4124e000 and 0x41f23000 in file ccache [967553.003301] conftest(27686): unaligned access to 0x00000000f9bc7755 at ip=0x000000004243b7fb [967553.003639] conftest(27686): unaligned access to 0x00000000f9bc7756 at ip=0x000000004243b807 [967830.349792] INEQUIVALENT ALIASES 0x84000 and 0xf6783000 in file cc1 [967830.350074] INEQUIVALENT ALIASES 0x83000 and 0xf6782000 in file cc1 [967830.365661] INEQUIVALENT ALIASES 0x10d000 and 0xf680c000 in file cc1 [967830.366026] INEQUIVALENT ALIASES 0x10c000 and 0xf680b000 in file cc1 [967830.366538] INEQUIVALENT ALIASES 0xfd000 and 0xf67fc000 in file cc1 [...] I see the TLB miss faults occasionally but not always with the big ALIASES wall. > > Possibly, the VMA interval tree is corrupt, so the loop doesn't terminate properly. > > vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) { > offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT; > addr = mpnt->vm_start + offset; > > /* The TLB is the engine of coherence on parisc: The > * CPU is entitled to speculate any page with a TLB > * mapping, so here we kill the mapping then flush the > * page along a special flush only alias mapping. > * This guarantees that the page is no-longer in the > * cache for any process and nor may it be > * speculatively read in (until the user or kernel > * specifically accesses it, of course) */ > > flush_tlb_page(mpnt, addr); > if (old_addr == 0 || (old_addr & (SHM_COLOUR - 1)) > != (addr & (SHM_COLOUR - 1))) { > __flush_cache_page(mpnt, addr, page_to_phys(page)); > if (parisc_requires_coherency() && old_addr) > printk(KERN_ERR "INEQUIVALENT ALIASES 0x%lx and 0x%lx in file %pD\n", old_addr, addr, mpnt->vm_file); > old_addr = addr; > } > } > > I see arm skips some VMAs: > > vma_interval_tree_foreach(mpnt, &mapping->i_mmap, pgoff, pgoff) { > unsigned long offset; > > /* > * If this VMA is not in our MM, we can ignore it. > */ > if (mpnt->vm_mm != mm) > continue; > if (!(mpnt->vm_flags & VM_MAYSHARE)) > continue; > offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT; > flush_cache_page(mpnt, mpnt->vm_start + offset, page_to_pfn(page)); > } > Is there anything I can do to confirm your suspicion? > Dave > > -- > John David Anglin dave.anglin@xxxxxxxx best, sam
Attachment:
signature.asc
Description: Message signed with OpenPGP