On Mon, May 4, 2020 at 1:26 PM Andy Lutomirski <luto@xxxxxxxxxx> wrote: > > On Mon, May 4, 2020 at 1:05 PM Luck, Tony <tony.luck@xxxxxxxxx> wrote: > > > > > When a copy function hits a bad page and the page is not yet known to > > > be bad, what does it do? (I.e. the page was believed to be fine but > > > the copy function gets #MC.) Does it unmap it right away? What does > > > it return? > > > > I suspect that we will only ever find a handful of situations where the > > kernel can recover from memory that has gone bad that are worth fixing > > (got to be some code path that touches a meaningful fraction of memory, > > otherwise we get code complexity without any meaningful payoff). > > > > I don't think we'd want different actions for the cases of "we just found out > > now that this page is bad" and "we got a notification an hour ago that this > > page had gone bad". Currently we treat those the same for application > > errors ... SIGBUS either way[1]. > > Oh, I agree that the end result should be the same. I'm thinking more > about the mechanism and the internal API. As a somewhat silly example > of why there's a difference, the first time we try to read from bad > memory, we can expect #MC (I assume, on a sensibly functioning > platform). But, once we get the #MC, I imagine that the #MC handler > will want to unmap the page to prevent a storm of additional #MC > events on the same page -- given the awful x86 #MC design, too many > all at once is fatal. So the next time we copy_mc_to_user() or > whatever from the memory, we'll get #PF instead. Or maybe that #MC > will defer the unmap? After the consumption the PMEM driver arranges for the page to never be mapped again via its "badblocks" list. > > So the point of my questions is that the overall design should be at > least somewhat settled before anyone tries to review just the copy > functions. I would say that DAX / PMEM stretches the Linux memory error handling model beyond what it was originally designed. The primary concepts that bend the assumptions of mm/memory-failure.c are: 1/ DAX pages can not be offlined via the page allocator. 2/ DAX pages (well cachelines in those pages) can be asynchronously marked poisoned by a platform or device patrol scrub facility. 3/ DAX pages might be repaired by writes. Currently 1/ and 2/ are managed by a per-block-device "badblocks" list that is populated by scrub results and also amended when #MC is raised (see nfit_handle_mce()). When fs/dax.c services faults it will decline to map the page if the physical file extent intersects a bad block. There is also support for sending SIGBUS if userspace races the scrubber to consume the badblock. However, that uses the standard 'struct page' error model and assumes that a file backed page is 1:1 mapped to a file. This requirement prevents filesystems from enabling reflink. That collision and the desire to enable reflink is why we are now investigating supplanting the mm/memory-failure.c model. When the page is "owned" by a filesystem invoke the filesystem to handle the memory error across all impacted files. The presence of 3/ means that any action error handling takes to disable access to the page needs to be capable of being undone, which runs counter to the mm/memory-failure.c assumption that offlining is a one-way trip.