Re: [PATCH] ACPI: don't show an error when we're not in charge of PCIe hotplug.

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Wed, 22 Jun 2016 13:53:19 -0700

On Wed, Jun 22, 2016 at 12:43 PM,  <Mario_Limonciello@xxxxxxxx> wrote:
>> -----Original Message-----
>> From: rjwysocki@xxxxxxxxx [mailto:rjwysocki@xxxxxxxxx] On Behalf Of
>> Rafael J. Wysocki
>> Sent: Tuesday, June 21, 2016 5:51 PM
>> To: Limonciello, Mario <Mario_Limonciello@xxxxxxxx>
>> Cc: Peter Jones <pjones@xxxxxxxxxx>; Rafael J. Wysocki
>> <rafael@xxxxxxxxxx>; ACPI Devel Maling List <linux-acpi@xxxxxxxxxxxxxxx>;
>> Linux Kernel Mailing List <linux-kernel@xxxxxxxxxxxxxxx>; Len Brown
>> <lenb@xxxxxxxxxx>; Rafael J. Wysocki <rjw@xxxxxxxxxxxxx>; Andy Lutomirski
>> <luto@xxxxxxxxxxxxxx>
>> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge of
>> PCIe hotplug.
>>
>> On Tue, Jun 21, 2016 at 8:01 PM,  <Mario_Limonciello@xxxxxxxx> wrote:
>> >> -----Original Message-----
>> >> From: Peter Jones [mailto:pjones@xxxxxxxxxx]
>> >> Sent: Tuesday, June 21, 2016 10:19 AM
>> >> To: Rafael J. Wysocki <rafael@xxxxxxxxxx>
>> >> Cc: ACPI Devel Maling List <linux-acpi@xxxxxxxxxxxxxxx>; Limonciello,
>> Mario
>> >> <Mario_Limonciello@xxxxxxxx>; Linux Kernel Mailing List <linux-
>> >> kernel@xxxxxxxxxxxxxxx>; Len Brown <lenb@xxxxxxxxxx>; Rafael J .
>> Wysocki
>> >> <rjw@xxxxxxxxxxxxx>; Andy Lutomirski <luto@xxxxxxxxxxxxxx>
>> >> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge
>> of
>> >> PCIe hotplug.
>> >>
>> >> (Sorry for the slow response - it's deadline time over here.)
>> >>
>> >> On Thu, Jun 16, 2016 at 04:56:57PM +0200, Rafael J. Wysocki wrote:
>> >> > On Thu, Jun 16, 2016 at 2:12 AM, Rafael J. Wysocki <rafael@xxxxxxxxxx>
>> >> wrote:
>> >> > > On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@xxxxxxxxxx>
>> >> wrote:
>> >> > >> Right now when booting, on many laptops the firmware manages
>> the
>> >> PCIe
>> >> > >> bus.  As a result, when we call the _OSC ACPI method, it returns an
>> >> > >> error code.  Unfortunately the errors are not very articulate.
>> >> > >
>> >> > > What exactly do you mean here?
>> >> > >
>> >> > >>  As a result, we show:
>> >> > >>
>> >> > >> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
>> >> > >> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
>> ClockPM
>> >> Segments MSI]
>> >> > >> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid
>> >> UUID
>> >> > >> _OSC request data: 1 1f 0
>> >> > >
>> >> > > So _OSC told us that the UUID was invalid, didn't it?
>> >> >
>> >> > BTW, the above messages are KERN_DEBUG, so at least in theory they
>> >> > shouldn't be visible in production runs.
>> >> >
>> >> > Maybe the bug to fix is that they show up when they aren't supposed
>> to?
>> >>
>> >> No - the workflow that I am really trying to remedy is this:
>> >>
>> >>  1) S3 resume sometimes isn't working on some laptop you've got.
>> >>  2) start looking at debug messages
>> >>  3) this shows an error, so it looks like it's probably the problem
>> >>  4) go fishing for red herring
>> >>  5) if you happen to know who maintains the DSDT for the platform in
>> >>     question, eventually work out that this is working as intended and
>> >>     the bug is someplace else.
>> >> 5b) if you don't know that person, eventually work out that it /might/
>> >>      be someplace else...
>> >>
>> >> So the idea was to make it look more like an indication of status, and
>> >> less like an error that's causing unrelated problems.
>> >>
>> >> When I talked to Mario at Dell (Cc'd), it wasn't clear to us that
>> >> there's a way to distinguish the between the UUID being
>> >> invalid/malformed, being merely unsupported, or being supported in
>> some
>> >> configurations but not the current one.  In this particular DSDT, the
>> >> machine doesn't support the OS controlling any of this if USB-C /
>> >> thunderbolt are enabled.  The DSDT is clearly written with the belief
>> >> that you have to completely disable the handling for that UUID in this
>> >> case, and googling for this looks like it's not the only one written
>> >> with that belief.
>> >>
>> >> Reading the spec (v6.1, sections 6.2.11.3 and 6.2.11.4), it seems
>> >> plausible that you can express this instead by handling the UUID but
>> >> choosing each individual query/status bit in the way that accomplishes
>> >> the OS doing nothing with the response.  So it may well be that that's
>> >> just more code that vendors have thought wasn't necessary (or wasn't
>> >> correct for some reason.)
>> >>
>> >> Mario, want to jump in on your thinking here?
>> >>
>> >> --
>> >>   Peter
>> >
>> > After talking to the team, I was told this particular implementation to not
>> let
>> > OS take control when acting on that specific UUID based upon a variable
>> > (NEXP in this case) came from Intel RC code.
>> >
>> > That's probably why this is all across a lot of platforms, including non-Dell.
>> >
>> > At least in the context of the laptop Peter noticed this on (Dell XPS 13 9350)
>> > NEXP is set in GNVS based upon Thunderbolt capability.
>> >
>> > As for why they return unrecognized UUID instead of just masking all the
>> > capabilities bits?  It's the same net functional result.  If the vendor provided
>> > RC code doesn't caused WCHK problems or functional problems it's hard to
>> > make a case for why it needs to be changed by the OEM.
>> >
>> > I think that Peter's patch is appropriate to message this is specifically
>> > what's going on.
>>
>> No, it may hide real (ie. non-intentional) bugs in _OSC, so it is not
>> appropriate.
>>
>> Debug-level messages really should not hurt anyone (and should never
>> show up in production anyway).
>>
>> We can slightly tone down the "_OSC failed (%s); disabling ASPM\n"
>> message in negotiate_os_control() in drivers/acpi/pci_root.c if you
>> think it's too strong and that's it.
>>
>> Thanks,
>> Rafael
>
> I think changing that would help communicate what's going on here and at
> least let the user know the result will be that the firmware is still controlling
> ASPM due to the _OSC failure.
>
> Something else that I think Andy recommended a while back was at that
> time try to evaluate NEXP and display its value and an associated message
> in debug logs when _OSC fails.  Would you be amenable to a change like that?

That seems dangerous if NEXP is anything other than a SystemMemory
variable.  I don't know if there's a clean way to check that before
evaluating it.  (i.e. we don't want to hit some other thing called
NEXP that has side effects.)

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html