On Mon, May 4, 2020 at 1:05 PM Luck, Tony <tony.luck@xxxxxxxxx> wrote: > > > When a copy function hits a bad page and the page is not yet known to > > be bad, what does it do? (I.e. the page was believed to be fine but > > the copy function gets #MC.) Does it unmap it right away? What does > > it return? > > I suspect that we will only ever find a handful of situations where the > kernel can recover from memory that has gone bad that are worth fixing > (got to be some code path that touches a meaningful fraction of memory, > otherwise we get code complexity without any meaningful payoff). > > I don't think we'd want different actions for the cases of "we just found out > now that this page is bad" and "we got a notification an hour ago that this > page had gone bad". Currently we treat those the same for application > errors ... SIGBUS either way[1]. Oh, I agree that the end result should be the same. I'm thinking more about the mechanism and the internal API. As a somewhat silly example of why there's a difference, the first time we try to read from bad memory, we can expect #MC (I assume, on a sensibly functioning platform). But, once we get the #MC, I imagine that the #MC handler will want to unmap the page to prevent a storm of additional #MC events on the same page -- given the awful x86 #MC design, too many all at once is fatal. So the next time we copy_mc_to_user() or whatever from the memory, we'll get #PF instead. Or maybe that #MC will defer the unmap? So the point of my questions is that the overall design should be at least somewhat settled before anyone tries to review just the copy functions.