Re: [PATCH v4 2/2] cxl: avoid duplicated report from MCE & device

Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx> · Mon, 2 Sep 2024 22:19:25 +0800

在 2024/8/27 23:52, Jonathan Cameron 写道:
On Thu,  8 Aug 2024 23:13:28 +0800
Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx> wrote:

Since CXL device is a memory device, while CPU is consuming a poison
page of CXL device, it always triggers a MCE (via interrupt #18) and
calls memory_failure() to handle POISON page, no matter which-First path
is configured.  CXL device could also find and report the POISON, kernel
now not only traces but also calls memory_failure() to handle it, which
is marked as "NEW" in the figure blow.
```
1.  MCE (interrupt #18, while CPU consuming POISON)
      -> do_machine_check()
        -> mce_log()
          -> notify chain (x86_mce_decoder_chain)
            -> memory_failure() <---------------------------- EXISTS
2.a FW-First (optional, CXL device proactively find&report)
      -> CXL device -> Firmware
        -> OS: ACPI->APEI->GHES->CPER -> CXL driver -> trace
                                                   \-> memory_failure()
                                                       ^----- NEW
2.b OS-First (optional, CXL device proactively find&report)
      -> CXL device -> MSI
        -> OS: CXL driver -> trace
                         \-> memory_failure()
                             ^------------------------------- NEW
```

But in this way, the memory_failure() could be called twice or even at
same time, as is shown in the figure above: (1.) and (2.a or 2.b),
before the POISON page is cleared.  memory_failure() has it own mutex
lock so it actually won't be called at same time and the later call
could be avoided because HWPoison bit has been set.  However, assume
such a scenario, "CXL device reports POISON error" triggers 1st call,
user see it from log and want to clear the poison by executing `cxl
clear-poison` command, and at the same time, a process tries to access
this POISON page, which triggers MCE (it's the 2nd call).

Attempting to clear poison in a page that is online seems unwise.
Does that ever make sense today?

To be honest, I am not sure about this.  Even if the error from CXL 
device is recoverable, we don't reuse it again?

  Since there
is no lock between the 2nd call with clearing poison operation, race
condition may happen, which may cause HWPoison bit of the page in an
unknown state.

As long as that state is always wrong in the sense we think it's poisoned
when it isn't we don't care.

The 2nd memory_failure() need this state to determine whether to 
continue its process or return.

Thus, we have to avoid the 2nd call. This patch[2] introduces a new
notifier_block into `x86_mce_decoder_chain` and a POISON cache list, to
stop the 2nd call of memory_failure(). It checks whether the current
poison page has been reported (if yes, stop the notifier chain, don't
call the following memory_failure() to report again).

If we do want to do this, it belongs in the generic code, not arch specific
part. Can we do similar in memory failure?

Yes, I saw the build error.  Will fix this.

To RAS reviewers, this isn't a new problem unique to CXL. Does a solution
like this make sense in practice, or are we fine to always let two reports
for the same error get handled?

Jonathan

--
Thanks,
Ruan.