On Thu, 8 Aug 2024 23:13:28 +0800 Shiyang Ruan <ruansy.fnst@xxxxxxxxxxx> wrote: > Since CXL device is a memory device, while CPU is consuming a poison > page of CXL device, it always triggers a MCE (via interrupt #18) and > calls memory_failure() to handle POISON page, no matter which-First path > is configured. CXL device could also find and report the POISON, kernel > now not only traces but also calls memory_failure() to handle it, which > is marked as "NEW" in the figure blow. > ``` > 1. MCE (interrupt #18, while CPU consuming POISON) > -> do_machine_check() > -> mce_log() > -> notify chain (x86_mce_decoder_chain) > -> memory_failure() <---------------------------- EXISTS > 2.a FW-First (optional, CXL device proactively find&report) > -> CXL device -> Firmware > -> OS: ACPI->APEI->GHES->CPER -> CXL driver -> trace > \-> memory_failure() > ^----- NEW > 2.b OS-First (optional, CXL device proactively find&report) > -> CXL device -> MSI > -> OS: CXL driver -> trace > \-> memory_failure() > ^------------------------------- NEW > ``` > > But in this way, the memory_failure() could be called twice or even at > same time, as is shown in the figure above: (1.) and (2.a or 2.b), > before the POISON page is cleared. memory_failure() has it own mutex > lock so it actually won't be called at same time and the later call > could be avoided because HWPoison bit has been set. However, assume > such a scenario, "CXL device reports POISON error" triggers 1st call, > user see it from log and want to clear the poison by executing `cxl > clear-poison` command, and at the same time, a process tries to access > this POISON page, which triggers MCE (it's the 2nd call). Attempting to clear poison in a page that is online seems unwise. Does that ever make sense today? > Since there > is no lock between the 2nd call with clearing poison operation, race > condition may happen, which may cause HWPoison bit of the page in an > unknown state. As long as that state is always wrong in the sense we think it's poisoned when it isn't we don't care. > > Thus, we have to avoid the 2nd call. This patch[2] introduces a new > notifier_block into `x86_mce_decoder_chain` and a POISON cache list, to > stop the 2nd call of memory_failure(). It checks whether the current > poison page has been reported (if yes, stop the notifier chain, don't > call the following memory_failure() to report again). > If we do want to do this, it belongs in the generic code, not arch specific part. Can we do similar in memory failure? To RAS reviewers, this isn't a new problem unique to CXL. Does a solution like this make sense in practice, or are we fine to always let two reports for the same error get handled? Jonathan