Re: [PATCH v4 0/4] Userspace controls soft-offline pages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks for your comment, Andi.

On Thu, Jun 20, 2024 at 3:53 PM Andi Kleen <ak@xxxxxxxxxxxxxxx> wrote:
>
> Jiaqi Yan <jiaqiyan@xxxxxxxxxx> writes:
>
> > Correctable memory errors are very common on servers with large
> > amount of memory, and are corrected by ECC, but with two
> > pain points to users:
> > 1. Correction usually happens on the fly and adds latency overhead
> > 2. Not-fully-proved theory states excessive correctable memory
> >    errors can develop into uncorrectable memory error.
>
> This patchkit is amusing (or maybe sad) because it basically tries to
> reconstruct the original soft offline design using a user space daemon
> instead of doing policy badly in the kernel.

Some clarifications. I don't intend to reconstruct. I think this
patchset can also be treated as "patch some missing places so that
kernel doesn't soft offline behind the back of userspace daemon".
I agree with you (IIUC) that the policy for corrected memory errors
should exist in userspace. But the situation is that some behaviors in
the kernel don't respect that (they either have a reason to not
respect, or just forget to respect). enable_soft_offline is basically
the big button in userspace to block these kernel violators.

>
> You can still have it by enabling CONFIG_X86_MCELOG_LEGACY and
> use http://www.mcelog.org or an equivalent daemon of your chosing
> that listens to /dev/mcelog.

If I didn't miss anything important in
https://github.com/andikleen/mcelog and
arch/x86/kernel/cpu/mce/dev-mcelog.c, I don't think /dev/mcelog works
on ARM platforms where CPER is used to convey hw errors from platform
to OS.

In addition, again taking an ARM platform as an example, I don't think
any userspace daemon has the way to stop the GHES driver from soft
offlining memory pages:
https://github.com/torvalds/linux/blob/master/drivers/acpi/apei/ghes.c#L521.
But of course it is not a problem if userspace always wants soft
offline to happen.

>
> -Andi
>
>





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux