Hi, On 1/15/21 8:01 PM, Rafael J. Wysocki wrote: > On Friday, January 15, 2021 5:41:57 PM CET Pierre-Louis Bossart wrote: >> >>>>> [ 0.516336] ACPI: \_SB_.PCI0.BRCM: Dependencies found >>>> >>>> Ah, that is enlightening, that is not supposed to happen, that device >>>> has both an _ADR and an _HID method which is not allowed according >>>> to the spec. >> >> it's not an uncommon issue for audio codecs, here's three examples: >> >> Device (RTK1) >> { >> Name (_ADR, Zero) // _ADR: Address >> Name (_HID, "10EC5670") // _HID: Hardware ID >> Name (_CID, "10EC5670") // _CID: Compatible ID >> Name (_DDN, "ALC5642") // _DDN: DOS Device Name >> >> Device (MAXM) >> { >> Name (_ADR, Zero) // _ADR: Address >> Name (_HID, "193C9890") // _HID: Hardware ID >> Name (_CID, "193C9890") // _CID: Compatible ID >> Name (_DDN, "Maxim 98090 Codec ") // _DDN: DOS Device Name >> >> Device (TISW) >> { >> Name (_ADR, Zero) // _ADR: Address >> Name (_HID, "104C227E") // _HID: Hardware ID >> Name (_CID, "104C227E") // _CID: Compatible ID >> >> It's been that way for years... >> >>>> Can you try a clean 5.11 kernel (so none of the previous >>>> debug patches) with the following change added: >>>> >>>> diff --git a/drivers/acpi/scan.c b/drivers/acpi/scan.c >>>> index 1f27f74cc83c..93954ac3bfcc 100644 >>>> --- a/drivers/acpi/scan.c >>>> +++ b/drivers/acpi/scan.c >>>> @@ -1854,7 +1854,8 @@ static u32 acpi_scan_check_dep(acpi_handle handle) >>>> * 2. ACPI nodes describing USB ports. >>>> * Still, checking for _HID catches more then just these cases ... >>>> */ >>>> - if (!acpi_has_method(handle, "_DEP") || !acpi_has_method(handle, "_HID")) >>>> + if (!acpi_has_method(handle, "_DEP") || !acpi_has_method(handle, "_HID") || >>>> + acpi_has_method(handle, "_ADR")) >>>> return 0; >>>> >>>> status = acpi_evaluate_reference(handle, "_DEP", NULL, &dep_devices); >>>> >>>> >>>>> [ 0.517490] ACPI: \_SB_.PCI0.LNPW: Dependencies found >>>> >>>> And idem. for this one. >>>> >>>> That might very well fix this. >> >> Nope, no luck with this patch. boot still stuck. > > OK, thanks! > > Now, there is a theory to test and some more debug work to do. > > First, the kernel should not crash outright if some ACPI device objects are > missing which evidently happens here. There may be some problems resulting > from that, but the crash indicates a code bug in the kernel. > > Apparently, something expects the device objects to be there so badly, that it > crashes right away when they aren't there. One of the issues that may cause > that to happen are mistakes around the acpi_bus_get_device() usage and I found > two of them, so below is a patch to test. > > Please apply to plain 5.11-rc3 (or -rc4 when it is out) and see if that makes > any difference. > > --- > drivers/acpi/scan.c | 3 +-- > drivers/usb/core/usb-acpi.c | 3 +-- > 2 files changed, 2 insertions(+), 4 deletions(-) > > Index: linux-pm/drivers/acpi/scan.c > =================================================================== > --- linux-pm.orig/drivers/acpi/scan.c > +++ linux-pm/drivers/acpi/scan.c > @@ -2120,8 +2120,7 @@ void acpi_walk_dep_device_list(acpi_hand > mutex_lock(&acpi_dep_list_lock); > list_for_each_entry_safe(dep, tmp, &acpi_dep_list, node) { > if (dep->supplier == handle) { > - acpi_bus_get_device(dep->consumer, &adev); > - if (!adev) > + if (acpi_bus_get_device(dep->consumer, &adev)) > continue; > > adev->dep_unmet--; Oh, OOOhh, good catch I've been staring at these exact lines multiple times, my "spidey sense" telling me that the problem was likely something like this: 1. The addition of the acpi_device gets deferred because of the _DEP list, this means that there are now entries for it on the acpi_dep_list 2. Later during the first pass, or before the handle is checked again during the second pass, acpi_walk_dep_device_list() gets called because the _DEP is now resolved. 3. My theory was this would lead to doing driver attach twice or some such, but that is not possible... But instead we are following a pointer which points to whatever the memory used by the: struct acpi_device *adev; local variable on the stack points to; and it seems, at least with my compiler / kernel .config that the stack layout is such that the stack memory does contain a valid pointer (from some previous functions stack frame) and then whatever that points to gets used as an acpi_device and likely gets mangled a bit. Which explains the memory-corruption like behavior which I've been seeing. Specifically I think that this happening when the MMC controller addition gets deferred to the second step: [ 0.426722] ACPI: \_SB_.PCI0.SDHB: Dependencies found [ 0.427927] ACPI: \_SB_.PCI0.SDHB.BRCM: Dependencies found [ 0.431863] ACPI: \_SB_.PCI0.SDHC: Dependencies found [ 0.433128] ACPI: \_SB_.PCI0.SHC1: Dependencies found I'll verify that this fixes both my reproducers (5.10 + backport on one device, 5.11-rc3 on another dev) by seeing if I can now boot 10 times in a row successfully. But I'm pretty hopeful that this will fix them. I'm also hopeful that this will fix Pierre-Louis' case too. > Index: linux-pm/drivers/usb/core/usb-acpi.c > =================================================================== > --- linux-pm.orig/drivers/usb/core/usb-acpi.c > +++ linux-pm/drivers/usb/core/usb-acpi.c > @@ -163,10 +163,9 @@ usb_acpi_get_companion_for_port(struct u > } else { > parent_handle = usb_get_hub_port_acpi_handle(udev->parent, > udev->portnum); > - if (!parent_handle) > + if (!parent_handle || acpi_bus_get_device(parent_handle, &adev)) > return NULL; > > - acpi_bus_get_device(parent_handle, &adev); > port1 = port_dev->portnum; > } > I'm pretty sure that we are not actually hitting this one, but we should definitely still fix it. Regards, Hans