RE: [PATCH 3/4] mce/copyin: fix to not SIGBUS when copying from user hits poison

"Luck, Tony" <tony.luck@xxxxxxxxx> · Tue, 13 Apr 2021 16:13:03 +0000

> So what I'm missing with all this fun is, yeah, sure, we have this
> facility out there but who's using it? Is anyone even using it at all?

Even if no applications ever do anything with it, it is still useful to avoid
crashing the whole system and just terminate one application/guest.

> If so, does it even make sense, does it need improvements, etc?

There's one more item on my long term TODO list. Add fixups so that
copy_to_user() from poison in the page cache doesn't crash, but just
checks to see if the page was clean .. .in which case re-read from the
filesystem into a different physical page and retire the old page ... the
read can now succeed. If the page is dirty, then fail the read (and retire
the page ... need to make sure filesystem knows the data for the page
was lost so subsequent reads return -EIO or something).

Page cache occupies enough memory that it is a big enough
source of system crashes that could be avoided. I'm not sure
if there are any other obvious cases after this ... it all gets into
diminishing returns ... not really worth it to handle a case that
only occupies 0.00002% of memory.

> Because from where I stand it all looks like we do all these fancy
> recovery things but is userspace even paying attention or using them or
> whatever...

See above. With core counts continuing to increase, the cloud service
providers really want to see fewer events that crash the whole physical
machine (taking down dozens, or hundreds, of guest VMs).

-Tony