[Bug 204807] Hardware monitoring sensor nct6798d doesn't work unless acpi_enforce_resources=lax is enabled

bugzilla-daemon@xxxxxxxxxxxxxxxxxxx · Fri, 19 Mar 2021 19:13:59 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=204807

Matthew Garrett (mjg59-kernel@xxxxxxxxxxxxx) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |INVALID

--- Comment #37 from Matthew Garrett (mjg59-kernel@xxxxxxxxxxxxx) ---
Here's the situation. Your ACPI tables declare that your system firmware may
access the addresses associated with your IO sensors. We have no idea what your
firmware may do here - it may do nothing (in which case accessing the addresses
is completely safe), or it may use them for its own internal monitoring. Sensor
hardware frequently uses indexed addressing, which means that accessing a
sensor requires something like the following:

1) Write the desired sensor to the index register
2) Read the sensor value from the data register

These can't occur simultaneously, so if both the OS and the firmware are
accessing it you risk ending up with something like:

1) Write sensor A to the index register (from the OS)
2) Write sensor B to the index register (from the firmware)
3) Read the sensor value from the data register (returns the value of sensor B
to the firmware)
4) Read the sensor value from the data register (returns the value of sensor B
to the OS)

The OS asked for the value of sensor A, but received the value of sensor B.
>From the OS side this is probably not a big deal (you get a weird value in your
graphing), but if it happens the other way around the firmware may decide that
the system is running out of spec and shut it down to avoid damage. This is not
a good user experience.

Why does Windows not have the same problem? Well, in the general case there's
nothing stopping it from doing so. Vendor tooling usually takes one of two
approaches:

1) They don't use the hardware sensors directly, they use firmware interfaces
to them. This is alluded to in comment #31 - on Asus systems, the sensors are
available via a WMI interface. Using a firmware interface ensures that the
firmware knows what the state of the hardware is, and avoids any race
conditions. Your board may well support an alternative firmware interface and
Linux simply lacks driver support for it. If so, I'm afraid that the correct
solution is to add that driver support. Given that this bug has ended up
covering boards from multiple vendors, it's no longer the correct place to
handle that, though.
2) The vendor knows that the firmware makes no policy decisions based on the
sensor values, so it's safe to access the resources even though the firmware
declares that it uses them. The problem with this approach is that *we* have no
way of knowing that it's safe, and the consequences of it being unsafe include
data loss. Given the choice between users being able to look at system
temperatures and users not losing data, we choose to prioritise users not
losing data.

Looking at your ACPI tables, we see the following:

    Name (IOHW, 0x0290)

    OperationRegion (SHWM, SystemIO, IOHW, 0x0A)
    Field (SHWM, ByteAcc, NoLock, Preserve)
    {
        Offset (0x05), 
        HIDX,   8, 
        HDAT,   8
    }

This means that there's a region of IO ports starting at address 0x290 and 0x0a
addresses long. This is the same region of port IO that your sensor chip uses.
Within that address range, we declare that 0x295 is called HIDX, and 0x296 is
called HDAT. This is consistent with an index and data register as described
above, which means that having the OS access this space directly is likely to
race with the firmware (ie, it's dangerous).

Near here are two methods called RHWM and WHWM. At a guess, that's "Read
Hardware Monitoring" and "Write Hardware Monitoring". These not only access the
sensors via the registers described above, they do some additional hardware
access around it. This is further evidence to support there being some
handshaking involved to avoid race conditions - the firmware takes a mutex and
appears to hit some other register that may also be used to guard against
racing against system management mode. We really, *really* want to be using the
firmware methods here rather than touching the sensor chip directly. At this
point, direct access isn't so much walking past a sign saying "Danger, keep
out", it's a sign saying "Proceed no further or you will die slowly and it will
hurt the entire time".

RHWM is referenced from the WMBD method if the first argument to it is RHWM,
and WHWM is referenced if the argument is WHWM. WMBD is the WMI dispatcher for
the WMI function with identifier "BD" - looking at your _WDG object, which
describes the available WMI interfaces, we have the following:

            Name (_WDG, Buffer (0x50)
            {
                /* 0000 */  0xD0, 0x5E, 0x84, 0x97, 0x6D, 0x4E, 0xDE, 0x11,  //
.^..mN..
                /* 0008 */  0x8A, 0x39, 0x08, 0x00, 0x20, 0x0C, 0x9A, 0x66,  //
.9.. ..f
                /* 0010 */  0x42, 0x43, 0x01, 0x02, 0xA0, 0x47, 0x67, 0x46,  //
BC...GgF
                /* 0018 */  0xEC, 0x70, 0xDE, 0x11, 0x8A, 0x39, 0x08, 0x00,  //
.p...9..
                /* 0020 */  0x20, 0x0C, 0x9A, 0x66, 0x42, 0x44, 0x01, 0x02,  //
 ..fBD..
                /* 0028 */  0x72, 0x0F, 0xBC, 0xAB, 0xA1, 0x8E, 0xD1, 0x11,  //
r.......
                /* 0030 */  0x00, 0xA0, 0xC9, 0x06, 0x29, 0x10, 0x00, 0x00,  //
....)...
                /* 0038 */  0xD2, 0x00, 0x01, 0x08, 0x21, 0x12, 0x90, 0x05,  //
....!...
                /* 0040 */  0x66, 0xD5, 0xD1, 0x11, 0xB2, 0xF0, 0x00, 0xA0,  //
f.......
                /* 0048 */  0xC9, 0x06, 0x29, 0x10, 0x4D, 0x4F, 0x01, 0x00   //
..).MO..
            })

The format of _WDG is 16 bytes of GUID, 2 bytes of ID or notification data, 1
byte of instance count and 1 byte of flags. The GUID used by asus-wmi
corresponds to the first GUID in this file,
97845ED0-4E6D-11DE-8A39-0800200C9A66. That has an ID of 0x4243, or BC - ie,
it's not the GUID we're looking for. The next GUID, however,
(466747a0-70ec-11de-8a39-0800200c9a66) has an identifier of 0x4344, or BD. So
this is the GUID we're looking for. Unfortunately asus-wmi doesn't handle this
GUID, so new code will need to be written.

I'm going to close this bug again because it's turned into a generic bug
covering different motherboard vendors, and there's no one size fits all
solution. For your case the correct way to handle it is for someone to write a
driver that uses the 466747a0-70ec-11de-8a39-0800200c9a66 interface to expose
the sensor data. I'm afraid I don't have relevant hardware so can't do this
myself, but please do open another bug for that.

tl;dr - the kernel message you're seeing is correct. Avoiding it requires a new
driver to be written. If you *personally* feel safe in ignoring the risks, you
can pass the acpi_enforce_resources=lax option, but that can't be the default
because it's unsafe in the general case, and so it isn't the solution to the
wider problem.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.