Re: ❌ FAIL: Test report for kernel 5.13.0-rc7 (arm-next, 8ab9b1a9)

Catalin Marinas <catalin.marinas@xxxxxxx> · Tue, 29 Jun 2021 17:35:43 +0100

On Tue, Jun 29, 2021 at 04:14:55PM +0100, Robin Murphy wrote:
> On 2021-06-29 15:44, Lorenzo Pieralisi wrote:
> > On Tue, Jun 29, 2021 at 12:48:14PM +0100, Robin Murphy wrote:
> > > [ +ACPI audience ]
> > > 
> > > On 2021-06-25 12:15, Robin Murphy wrote:
> > > > On 2021-06-25 12:09, Catalin Marinas wrote:
> > > > > On Fri, Jun 25, 2021 at 12:02:52PM +0100, Robin Murphy wrote:
> > > > > > On 2021-06-25 10:52, Veronika Kabatova wrote:
> > > > > > [...]
> > > > > > > > >           ❌ stress: stress-ng
> > > > > > > > 
> > > > > > > > Oh no, this looks like another alignment fault in memcpy:
> > > > > > > > 
> > > > > > > > [13330.651903] Unable to handle kernel paging request at
> > > > > > > > virtual address ffff8000534705ff
[...]
> > > > > > > > [13330.652218] Call trace:
> > > > > > > > [13330.652221]  __memcpy+0x168/0x250
> > > > > > > > [13330.652225]  acpi_data_show+0x5c/0x8c
> > > > > > > > [13330.652232]  sysfs_kf_bin_read+0x78/0xa0
> > > > > > > > [13330.652238]  kernfs_file_read_iter+0x9c/0x1a4
> > > > > > > > [13330.652241]  kernfs_fop_read_iter+0x34/0x50
> > > > > > > > [13330.652244]  new_sync_read+0xdc/0x154
> > > > > > > > [13330.652253]  vfs_read+0x158/0x1e4
> > > > > > > > [13330.652260]  ksys_read+0x64/0xec
> > > > > > > > [13330.652266]  __arm64_sys_read+0x28/0x34
> > > > > > > > [13330.652273]  invoke_syscall+0x50/0x120
> > > > > > > > [13330.652280]  el0_svc_common.constprop.0+0x4c/0xd4
> > > > > > > > [13330.652284]  do_el0_svc+0x30/0x9c
> > > > > > > > [13330.652286]  el0_svc+0x2c/0x54
> > > > > > > > [13330.652294]  el0t_64_sync_handler+0x1a4/0x1b0
> > > > > > > > [13330.652296]  el0t_64_sync+0x19c/0x1a0
> > > > > > > > [13330.652303] Code: a984346c a9c4342c f1010042 54fffee8 (a97c3c8e)
> > > > > > > > [13330.652307] ---[ end trace 227d4380f57145d4 ]---
> > > > > > > > 
> > > > > > > > So maybe this issue isn't limited to weird modules, after all...
> > > > > > > 
> > > > > > > It ran on the machine from the same set that we were able to reproduce
> > > > > > > it on previously. If you or anyone else have an idea on how
> > > > > > > to stabilize the reproducibility or have a debug patch we'll be happy to try it.
> > > > > > 
> > > > > > Possibly it depends on the individual machines' firmware exactly how the
> > > > > > relevant bits of their ACPI tables are aligned in memory?
> > > > > > 
> > > > > > I've started digging into that callstack - it may not be a "weird module"
> > > > > > but it's definitely crusty ACPI code... a238317ce818 ("ACPI: Clean up
> > > > > > acpi_os_map/unmap_memory() to eliminate __iomem.") looks frankly a bit
> > > > > > questionable in its decision to blithely cast away __iomem, but then the
> > > > > > rationale in aafc65c731fe ("ACPI: add arm64 to the platforms that use
> > > > > > ioremap") seems particularly dubious on top of that (especially
> > > > > > given this end result).
[...]
> > > After picking through the UEFI spec I think I've now got a clearer picture
> > > of what's happening, but I'm not sure where it goes from here...
> > > 
> > > The spec implies that it *is* legitimate for runtime-loaded ACPI tables to
> > > lie outside the EFI memory map, and that case they must be assumed to be
> > > uncached, so the behaviour of acpi_os_ioremap() is correct.
> > 
> > I'd agree with the reasoning, it would be good to pinpoint whether
> > that's what actually triggers the issue.
> > 
> > I'd like to replicate it if possible (it is TX2 HW but firmware
> > config is likely to differ from the HW I have at hand), the
> > test command line that triggers the fault would be useful as
> > a starting point.
> > 
> > Furthermore, is this a v5.13-rc* regression ? If so it would be
> > good to bisect it - I can't recollect arm64 changes that could
> > have introduced this regression in the last cycle but I may have
> > missed something.
> 
> The actual change which has brought this to light is the update to arm64's
> memcpy() routine for 5.13 - the new version is more aggressive at making
> unaligned loads from the source buffer, so now triggers alignment faults
> more readily when (wrongly) used on iomem mappings in places that were
> getting away with it by chance under the previous implementation (see also
> [1], for example).

I wouldn't revert any of the memcpy() stuff as it just uncovered an
existing bug in how the ACPI tables are handled. Could we actually hit
a similar issue with C code parsing the ACPI tables?

Is there a way to map the ACPI tables as Normal Noncacheable
(ioremap_wc)? Presumably no-one sane would place ACPI tables in memory
that's sensitive to the access size.

-- 
Catalin