Re: Crash in acpi_ns_validate_handle triggered by soundwire on Linux 5.10

Marcin Ślusarz <marcin.slusarz@xxxxxxxxx> · Thu, 4 Feb 2021 13:11:21 +0100

pon., 1 lut 2021 o 13:16 Marcin Ślusarz <marcin.slusarz@xxxxxxxxx> napisał(a):
>
> pon., 1 lut 2021 o 12:43 Rafael J. Wysocki <rafael@xxxxxxxxxx> napisał(a):
> >
> > On Fri, Jan 29, 2021 at 9:03 PM Marcin Ślusarz <marcin.slusarz@xxxxxxxxx> wrote:
> > >
> > > pt., 29 sty 2021 o 19:59 Marcin Ślusarz <marcin.slusarz@xxxxxxxxx> napisał(a):
> > > >
> > > > czw., 28 sty 2021 o 15:32 Marcin Ślusarz <marcin.slusarz@xxxxxxxxx> napisał(a):
> > > > >
> > > > > czw., 28 sty 2021 o 13:39 Rafael J. Wysocki <rafael@xxxxxxxxxx> napisał(a):
> > > > > > The only explanation for that I can think about (and which does not
> > > > > > involve supernatural intervention so to speak) is a stack corruption
> > > > > > occurring between these two calls in sdw_intel_acpi_cb().  IOW,
> > > > > > something scribbles on the handle in the meantime, but ATM I have no
> > > > > > idea what that can be.
> > > > >
> > > > > I tried KASAN but it didn't find anything and kernel actually booted
> > > > > successfully.
> > > >
> > > > I investigated this and it looks like a compiler bug (or something nastier),
> > > > but I can't find where exactly registers get corrupted because if I add printks
> > > > the corruption seems on the printk side, but if I don't add them it seems
> > > > the value gets corrupted earlier.
> > > (...)
> > > > I'm using gcc 10.2.1 from Debian testing.
> > >
> > > Someone on IRC, after hearing only that "gcc miscompiles the kernel",
> > > suggested disabling CONFIG_STACKPROTECTOR_STRONG.
> > > It helped indeed and it matches my observations, so it's quite likely it
> > > is the culprit.
> > >
> > > What do we do now?
> >
> > Figure out why the stack protection kicks in, I suppose.
> >
> > The target object is not on the stack, so if the pointer to it is
> > valid (we need to verify somehow that it is indeed), dereferencing it
> > shouldn't cause the stack protection to trigger.
>
> Well, the problem is not that stack protector finds something, but
> the feature itself corrupts some registers.

I retract this statement.

Originally I based it on this piece of code:
   0xffffffff815781f0 <+35>:    mov    %r12,%rdx
   0xffffffff815781f3 <+38>:    mov    $0xffffffff81eca4c0,%rsi
   0xffffffff815781fa <+45>:    mov    $0xffffffff82146d46,%rdi
   0xffffffff81578201 <+52>:    call   0xffffffff818909f1 <printk>
   0xffffffff81578206 <+57>:    cmpb   $0xf,0x8(%r12)
where crash is on the last line and I supposedly could see the message
printed by printk with the correct value of %r12.
However, after attaching kgdb+kgdboe (it's so much pain...) to the kernel
I discovered that someting corrupts memory so much that the formatting
string becomes "", which means that I don't actually see the output of printk.

So stack corruption from printk is rather unlikely and something else
must be going on.

Before I started messing with kgdb, I tried to bisect this issue - it pointed at
279c3393e2c113365c999f16cd096bcf3d34319e "mm: kmem: move
memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current()",
which is odd, because it's totally unrelated and doesn't even trigger
recompilation of anything else. I can consistently reproduce the crash
on this commit and can't on commit before. Reverting it on 5.10.11 is
not possible, because it conflicts with changes that went in after this one.

acpi_ns_validate_handle is called hundreds (if not thousands) of times
before it crashes, so I think it's unlikely that it is compiled incorrectly
(and I spent many hours reading the assembly, comparing to what
gcc 9 generates, diving into printk, etc).
Something before it must be corrupting memory.

Another thing that I noticed is that when I set breakpoints in kgdb
on two functions (do_init_module and local_pci_probe) and just hit
"continue" the kernel doesn't crash!

I discovered it because I wanted to trace sdw_intel_acpi_scan /
sdw_intel_acpi_cb to see where the memory is corrupted, but I can't
set breakpoints on code in modules with kgdb :(, so when I tried
to step into this code from module loading the crash disappeared.

The first code I could trace where I see memory corruption is
acpi_bus_get_device, which is called from sdw_intel_scan_controller.
I suspect that sdw_intel_acpi_scan is doing this (which means that
sdw_intel_acpi_cb -> acpi_evaluate_integer is likely to blame),
but I don't have proof.

This issue is driving me mad ;). Please help.

Marcin