> So what I'm missing with all this fun is, yeah, sure, we have this > facility out there but who's using it? Is anyone even using it at all? Even if no applications ever do anything with it, it is still useful to avoid crashing the whole system and just terminate one application/guest. > If so, does it even make sense, does it need improvements, etc? There's one more item on my long term TODO list. Add fixups so that copy_to_user() from poison in the page cache doesn't crash, but just checks to see if the page was clean .. .in which case re-read from the filesystem into a different physical page and retire the old page ... the read can now succeed. If the page is dirty, then fail the read (and retire the page ... need to make sure filesystem knows the data for the page was lost so subsequent reads return -EIO or something). Page cache occupies enough memory that it is a big enough source of system crashes that could be avoided. I'm not sure if there are any other obvious cases after this ... it all gets into diminishing returns ... not really worth it to handle a case that only occupies 0.00002% of memory. > Because from where I stand it all looks like we do all these fancy > recovery things but is userspace even paying attention or using them or > whatever... See above. With core counts continuing to increase, the cloud service providers really want to see fewer events that crash the whole physical machine (taking down dozens, or hundreds, of guest VMs). -Tony