pon., 1 lut 2021 o 13:16 Marcin Ślusarz <marcin.slusarz@xxxxxxxxx> napisał(a): > > pon., 1 lut 2021 o 12:43 Rafael J. Wysocki <rafael@xxxxxxxxxx> napisał(a): > > > > On Fri, Jan 29, 2021 at 9:03 PM Marcin Ślusarz <marcin.slusarz@xxxxxxxxx> wrote: > > > > > > pt., 29 sty 2021 o 19:59 Marcin Ślusarz <marcin.slusarz@xxxxxxxxx> napisał(a): > > > > > > > > czw., 28 sty 2021 o 15:32 Marcin Ślusarz <marcin.slusarz@xxxxxxxxx> napisał(a): > > > > > > > > > > czw., 28 sty 2021 o 13:39 Rafael J. Wysocki <rafael@xxxxxxxxxx> napisał(a): > > > > > > The only explanation for that I can think about (and which does not > > > > > > involve supernatural intervention so to speak) is a stack corruption > > > > > > occurring between these two calls in sdw_intel_acpi_cb(). IOW, > > > > > > something scribbles on the handle in the meantime, but ATM I have no > > > > > > idea what that can be. > > > > > > > > > > I tried KASAN but it didn't find anything and kernel actually booted > > > > > successfully. > > > > > > > > I investigated this and it looks like a compiler bug (or something nastier), > > > > but I can't find where exactly registers get corrupted because if I add printks > > > > the corruption seems on the printk side, but if I don't add them it seems > > > > the value gets corrupted earlier. > > > (...) > > > > I'm using gcc 10.2.1 from Debian testing. > > > > > > Someone on IRC, after hearing only that "gcc miscompiles the kernel", > > > suggested disabling CONFIG_STACKPROTECTOR_STRONG. > > > It helped indeed and it matches my observations, so it's quite likely it > > > is the culprit. > > > > > > What do we do now? > > > > Figure out why the stack protection kicks in, I suppose. > > > > The target object is not on the stack, so if the pointer to it is > > valid (we need to verify somehow that it is indeed), dereferencing it > > shouldn't cause the stack protection to trigger. > > Well, the problem is not that stack protector finds something, but > the feature itself corrupts some registers. I retract this statement. Originally I based it on this piece of code: 0xffffffff815781f0 <+35>: mov %r12,%rdx 0xffffffff815781f3 <+38>: mov $0xffffffff81eca4c0,%rsi 0xffffffff815781fa <+45>: mov $0xffffffff82146d46,%rdi 0xffffffff81578201 <+52>: call 0xffffffff818909f1 <printk> 0xffffffff81578206 <+57>: cmpb $0xf,0x8(%r12) where crash is on the last line and I supposedly could see the message printed by printk with the correct value of %r12. However, after attaching kgdb+kgdboe (it's so much pain...) to the kernel I discovered that someting corrupts memory so much that the formatting string becomes "", which means that I don't actually see the output of printk. So stack corruption from printk is rather unlikely and something else must be going on. Before I started messing with kgdb, I tried to bisect this issue - it pointed at 279c3393e2c113365c999f16cd096bcf3d34319e "mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current()", which is odd, because it's totally unrelated and doesn't even trigger recompilation of anything else. I can consistently reproduce the crash on this commit and can't on commit before. Reverting it on 5.10.11 is not possible, because it conflicts with changes that went in after this one. acpi_ns_validate_handle is called hundreds (if not thousands) of times before it crashes, so I think it's unlikely that it is compiled incorrectly (and I spent many hours reading the assembly, comparing to what gcc 9 generates, diving into printk, etc). Something before it must be corrupting memory. Another thing that I noticed is that when I set breakpoints in kgdb on two functions (do_init_module and local_pci_probe) and just hit "continue" the kernel doesn't crash! I discovered it because I wanted to trace sdw_intel_acpi_scan / sdw_intel_acpi_cb to see where the memory is corrupted, but I can't set breakpoints on code in modules with kgdb :(, so when I tried to step into this code from module loading the crash disappeared. The first code I could trace where I see memory corruption is acpi_bus_get_device, which is called from sdw_intel_scan_controller. I suspect that sdw_intel_acpi_scan is doing this (which means that sdw_intel_acpi_cb -> acpi_evaluate_integer is likely to blame), but I don't have proof. This issue is driving me mad ;). Please help. Marcin