Re: [RFC PATCH] cxl: avoid duplicating report from MCE & device

Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx> · Fri, 19 Jul 2024 14:24:13 +0800

在 2024/6/19 0:53, Shiyang Ruan 写道:

This patch adds a new notifier_block and MCE_PRIO_CXL, for CXL memdev
to check whether the current poison page has been reported (if yes,
stop the notifier chain, won't call the following memory_failure()
to report), into `x86_mce_decoder_chain`.  In this way, if the poison
page already handled(recorded and reported) in (1) or (2), the other one
won't duplicate the report.  The record could be clear when
cxl_clear_poison() is called.

Hi guys,

I'd like to sort out the work I am currently carrying forward, to make 
sure I'm not going in the wrong direction. Please correct me if anything 
is wrong.

As is known to us, CXL spec defines POISON feature to notify its status 
when CXL memory device got a broken page.  Basically, there are two 
major paths for the notification.

1. CPU handling error
  When a process is accessing this broken page, CXL device returns data
  with POISON.  When CPU consumes the POISON, it raises a kind of error
  notification.
  To be precise, "how CPU should behave when it consumes POISON" is
  architecture dependent.  In my understanding, x86-64 raises Machine
  Check Exception(MCE) via interrupt #18 in this case.
2. CXL device reporting error
  When CXL device detects the broken page by itself and sends memory
  error signal to kernel in two optional paths.
  2.a. FW-First
    CXL device sends error via VDM to CXL Host, then CXL Host sends it
    to System Firmware via interrupt, finally kernel handles the error.
  2.b. OS-First
    CXL device directly sends error via MSI/MSI-X to kernel.

Note: Since I'm now focusing on x86_64, basically I'll describe about 
x86-64 only.

The following diagram should describe the 2 major paths and 2 optional 
sub-paths above.
```
1.  MCE (interrupt #18, while CPU consuming POISON)
    -> do_machine_check()
      -> mce_log()
        -> notify chain (x86_mce_decoder_chain)
          -> memory_failure()
2.a FW-First (optional, CXL device proactively find&report)
    -> CXL device -> Firmware
      -> OS: ACPI->APEI->GHES->CPER -> CXL driver -> trace
2.b OS-First (optional, CXL device proactively find&report)
    -> CXL device -> MSI
      -> OS: CXL driver -> trace
```

For "1. CPU handling error" path, the current code seems to work fine. 
When I used error injection feature on QEMU emulation, the code path is 
executed certainly.  Then, if the CPU certainly raises a MCE when it 
consumes the POISON, this path has no problem.

So, I'm working on making for 2.a and 2.b path, which is CXL device 
reported POISON error could be handled by kernel.  This path has two 
advantages.

- Proactively find&report memory problems

  Even if a process does not read data yet, kernel/drivers can prevent
  the process from using corrupted data proactively.  AFAIK, the current
  kernel only traces POISON error event from FW-First/OS-First path, but
  it doesn't handle them, neither notify processes who are using the
  POISON page like MCE does.  User space tools like rasdaemon reads the
  trace and log it, but as well, it doesn't handle the POISON page.  As
  a result, user has to read the error log from rasdaemon, distinguish
  whether the POISON error is from CXL memory or DDR memory, find out
  which applications are effected.  That is not an easy work and cannot
  be handled in time.  Thus, I'd like to add a feature to make the work
  done automatically and quickly. Once CXL device reports the POISON
  error (via FW-First/OS-First), kernel handles it immediately, similar
  to the flow when a MCE is triggered.  This is my first motivation.

- Architecture independent

  As the mentioned above, "1. CPU handling error" path is architecture
  dependent.  On the other hand, this route can be architecture
  independent code.  If there is a CPU which does not have similar
  feature like MCE of x86-64, my work will be essential.  (To be honest,
  I did not notice this advantage at first as mentioned later, but I
  think this is also important.)

Here is the timeline of my development of it.

Two series of patches have been sent so far:
- PATCH: cxl/core: add poison creation event handler [1]
- PATCH: cxl: avoid duplicating report from MCE & device [2]
[1] 
https://lore.kernel.org/linux-cxl/20240417075053.3273543-1-ruansy.fnst@xxxxxxxxxxx/
[2] 
https://lore.kernel.org/linux-cxl/20240618165310.877974-1-ruansy.fnst@xxxxxxxxxxx/

The 1st patch[1] added POISON error handler in "2. CXL device reporting 
error" path.

My first version was constructing a MCE data from POISON address and 
calling mce_log() to handle the POISON.  But I was told that 
constructing MCE data is architecture dependent while CXL is not.  So, 
in later version, just call memory_failure_queue() in CXL to handle the 
POISON error to avoid the arch-dependent problem.

After many discussions, a new problem was found: as Dan said[3], added 
POISON handling will cause the "duplicate report" problem:
> So, I think all CXL poison notification events should trigger an
> action optional memory_failure(). I expect this needs to make sure
> that duplicates re not a problem. I.e. in the case of CPU consumption
> of CXL poison, that causes a synchronous MF_ACTION_REQUIRED event via
> the MCE path *and* it may trigger the device to send an error record
> for the same page. As far as I can see, duplicate reports (MCE + CXL
> device) are unavoidable.

[3] 
https://lore.kernel.org/linux-cxl/664d948fb86f0_e8be294f8@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.notmuch/

To solve this problem, I made the 2nd patch[2].  Allow me to describe 
the background again:

Since CXL device is a memory device, while CPU is consuming a poison 
page of CXL device, it always triggers a MCE (via interrupt #18) and 
calls memory_failure() to handle POISON page, no matter which-First path 
is configured.

My patch added memory_failure() in FW-First/OS-First path: if device 
finds and reports the POISON, kernel not only traces but also calls 
memory_failure() to handle it, marked as "ADD" in the figure blow.
```
1.  MCE (interrupt #18, while CPU consuming POISON)
    -> do_machine_check()
      -> mce_log()
        -> notify chain (x86_mce_decoder_chain)
          -> memory_failure() <---------------------------- EXISTS
2.a FW-First (optional, CXL device proactively find&report)
    -> CXL device -> Firmware
      -> OS: ACPI->APEI->GHES->CPER -> CXL driver -> trace
                                                 \-> memory_failure()
                                                     ^----- ADD
2.b OS-First (optional, CXL device proactively find&report)
    -> CXL device -> MSI
      -> OS: CXL driver -> trace
                       \-> memory_failure()
                           ^------------------------------- ADD
```

But in this way, the memory_failure() could be called twice or even at 
same time, as is shown in the figure above: (1.) and (2.a or 2.b), 
before the POISON page is cleared.  memory_failure() has it own mutex 
lock so it actually won't be called at same time and the later call 
could be avoided because HWPoison bit has been set.  However, assume 
such a scenario, "CXL device reports POISON error" triggers 1st call, 
user see it from log and want to clear the poison by executing `cxl 
clear-poison` command, and at the same time, a process tries to access 
this POISON page, which triggers MCE (it's the 2nd call).  Since there 
is no lock between the 2nd call with clearing poison operation, race 
condition may happen, which may cause HWPoison bit of the page in an 
unknown state.

Thus, we have to avoid the 2nd call. This patch[2] introduces a new 
notifier_block into `x86_mce_decoder_chain` and a POISON cache list, to 
stop the 2nd call of memory_failure(). It checks whether the current 
poison page has been reported (if yes, stop the notifier chain, don't 
call the following memory_failure() to report again).

Looking forward to your comments!

--
Thanks,
Ruan.