Re: ACPI scan regression -> Boot fail on Cherrytrail w/ 5.11-rc3

Hans de Goede <hdegoede@xxxxxxxxxx> · Fri, 15 Jan 2021 00:34:46 +0100

Hi,

On 1/14/21 10:55 PM, Pierre-Louis Bossart wrote:
> Hi,
> My primary test device for SOF on Cherrytrail no longer boots with v5.11-rc3 and the sof-dev branch, nothing happens after the 'loading initial ramdisk'. It's a 'Zotac' headless device derived from the Cherrytrail FFD design, so likely there are other devices hit by this problem.
> 
> A long bisect points to the commit 71da201f38dfb ('ACPI: scan: Defer enumeration of devices with _DEP lists').
> 
> Reverting the two commits below solves the boot issue.
> 
> I have absolutely no idea what these two patches do, but they sure have a large impact. Please let me know what sort of information or tests might help root-cause this problem.

Heh, I was just about to answer your other (off-list) email about your
CHT test device booting with a suggestion that you should try reverting that
exact commit as it is the only commit that I'm aware of which went into 5.11
which might cause this...

So I just boot 5.11-rc3 on a Acer Aspire Switch 10E SW3-016 (x5-Z8300 CHT
based) myself and that booted fine.`

Next I tried a MINIX NEO Z83-4 (x5-Z8300) which is a Mini PC and as such
probably the closest to the Zotac box which you are using which I have
at hand to test on, and I can somewhat reproduce it there.

It seems that the new code somehow causes us to hit a race somewhere, so
the NEO Z83-4 will boot most of the times but not always, it get past
the loading initrd phase for me and then it threw the following error
and after that the boot hung (waiting for the rootfs to show up)

platform device 80860F14: Resources present before probing

As I already told Rafael in a previous email, I did see something
similar when my personal tree was still 5.10 based, with the ACPI
scan rework patches cherry-picked for testing. In that case I got
a backtrace (followed by a hang) during boot about a kernel NULL
pointer deref triggered by sysfs_seq_file_read or some such. But
this problem went away with 5.11-rc1, so I stopped looking into
it. I do have a tag of my broken 5.10 + cherry-picks tree, so
I should be able to reproduce that issue.

So I see 2 possible theories here:

1. We have 2 probes of the same device racing somehow
2. The struct device memory is getting corrupted somehow.

Pierre-Louis, can you see if the following hack helps? :

--- a/drivers/acpi/scan.c
+++ b/drivers/acpi/scan.c
@@ -1939,7 +1939,6 @@ static acpi_status acpi_bus_check_add(acpi_handle handle, bool check_dep,
 		/* Bail out if the number of recorded dependencies is not 0. */
 		if (count > 0) {
 			acpi_bus_scan_second_pass = true;
-			return AE_CTRL_DEPTH;
 		}
 	}
 
@@ -1948,8 +1947,7 @@ static acpi_status acpi_bus_check_add(acpi_handle handle, bool check_dep,
 		return AE_CTRL_DEPTH;
 
 	acpi_scan_init_hotplug(device);
-	if (!check_dep)
-		acpi_scan_dep_init(device);
+	acpi_scan_dep_init(device);
 
 out:
 	if (!*adev_p)

And can you collect an acpidump from the device and either send it to me and Rafael
offlist, or upload it somewhere and send us a link ?

Regards,

Hans