Re: [PATCH v9 0/2] ACPI: APEI: handle synchronous errors in task work with proper si_code

Borislav Petkov <bp@xxxxxxxxx> · Thu, 23 Nov 2023 16:07:10 +0100

On Sat, Oct 07, 2023 at 03:28:16PM +0800, Shuai Xue wrote:
> However, this trick is not always be effective

So far so good.

What's missing here is why "this trick" is not always effective.

Basically to explain what exactly the problem is.

> For example, hwpoison-aware user-space processes use the si_code:
> BUS_MCEERR_AO for 'action optional' early notifications, and BUS_MCEERR_AR
> for 'action required' synchronous/late notifications. Specifically, when a
> signal with SIGBUS_MCEERR_AR is delivered to QEMU, it will inject a vSEA to
> Guest kernel. In contrast, a signal with SIGBUS_MCEERR_AO will be ignored
> by QEMU.[1]
> 
> Fix it by seting memory failure flags as MF_ACTION_REQUIRED on synchronous events. (PATCH 1)

So you're fixing qemu by "fixing" the kernel?

This doesn't make any sense.

Make errors which are ACPI_HEST_NOTIFY_SEA type return
MF_ACTION_REQUIRED so that it *happens* to fix your use case.

Sounds like a lot of nonsense to me.

What is the issue here you're trying to solve?

> 2. Handle memory_failure() abnormal fails to avoid a unnecessary reboot
> 
> If process mapping fault page, but memory_failure() abnormal return before
> try_to_unmap(), for example, the fault page process mapping is KSM page.
> In this case, arm64 cannot use the page fault process to terminate the
> synchronous exception loop.[4]
> 
> This loop can potentially exceed the platform firmware threshold or even trigger
> a kernel hard lockup, leading to a system reboot. However, kernel has the
> capability to recover from this error.
> 
> Fix it by performing a force kill when memory_failure() abnormal fails or when
> other abnormal synchronous errors occur.

Just like that?

Without giving the process the opportunity to even save its other data?

So this all is still very confusing, patches definitely need splitting
and this whole thing needs restraint.

You go and do this: you split *each* issue you're addressing into
a separate patch and explain it like this:

---
1. Prepare the context for the explanation briefly.

2. Explain the problem at hand.

3. "It happens because of <...>"

4. "Fix it by doing X"

5. "(Potentially do Y)."
---

and each patch explains *exactly* *one* issue, what happens, why it
happens and just the fix for it and *why* it is needed.

Otherwise, this is unreviewable.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette