On Sat, May 06, 2023 at 10:04:48AM -0700, Linus Torvalds wrote: > On Sat, May 6, 2023 at 9:35 AM Linus Torvalds > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > > > And yes, the simplest fix for the "wrong test" would be to just add a > > new "out_nofolio" error case after "out_retry", and use that. > > > > However, even that seems wrong, because the return value for that path > > is the wrong one. > > Actually, my suggested patch is _also_ wrong. > > The problem is that we do need to return VM_FAULT_RETRY to let the > caller know that we released the mmap_lock. > > And once we return VM_FAULT_RETRY, the other error bits don't even matter. > > So while I think the *right* thing to do is to return VM_FAULT_OOM | > VM_FAULT_RETRY, that doesn't actually end up working, because if > VM_FAULT_RETRY is set, the caller will know that "yes, mmap_lock was > dropped", but the callers will also just ignore the other bits and > unconditionally retry. > > How very very annoying. > > This was introduced several years ago by commit 6b4c9f446981 > ("filemap: drop the mmap_sem for all blocking operations"). > > Looking at that, we have at least one other similar error case wrong > too: the "page_not_uptodate" case carefully checks for IO errors and > retries only if there was no error (or for the AOP_TRUNCATED_PAGE) > case. > > For an actual IO error on page reading, it returns VM_FAULT_SIGBUS. > > Except - again - for that "if (fpin) goto out_retry" case, which will > just return VM_FAULT_RETRY and retry the fault. > > I do not believe that retrying the fault is the right thing to do when > we ran out of memory, or when we had an IO error, and I do not think > it was intentional that the error handling was changed. This is a while ago and the code has changed quite a bit since, so bear with me. Originally, we only ever did a maximum of two tries: one where the lock might be dropped to kick off IO, then a synchronous one. IIRC the thinking at the time was that events like OOMs and IO failures are rare enough that doing the retry anyway (even if somewhat pointless) and reacting to the issue then (if it persisted) was a tradeoff to keep the retry logic simple. Since 4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple times") we don't clear FAULT_FLAG_ALLOW_RETRY anymore though, and we might see more than one loop. At least outside the page cache. So I agree it makes sense to look at the return value more carefully and act on errors more timely in the arch handler. Draft patch below. It survives a boot and a will-it-scale smoke test, but I haven't put it through the grinder yet. One thing that gave me pause is this comment: /* * If we need to retry the mmap_lock has already been released, * and if there is a fatal signal pending there is no guarantee * that we made any progress. Handle this case first. */ I think it made sense when it was added in 26178ec11ef3 ("x86: mm: consolidate VM_FAULT_RETRY handling"). But after 39678191cd89 ("x86/mm: use helper fault_signal_pending()") it's in a misleading location, since the signal handling is above it. So I'm removing it, but please let me know if I missed something. --- arch/x86/mm/fault.c | 40 +++++++++++++++++++++++----------------- mm/filemap.c | 18 +++++++++++------- 2 files changed, 34 insertions(+), 24 deletions(-) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index e4399983c50c..f1d242be723f 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1456,20 +1456,15 @@ void do_user_addr_fault(struct pt_regs *regs, return; /* - * If we need to retry the mmap_lock has already been released, - * and if there is a fatal signal pending there is no guarantee - * that we made any progress. Handle this case first. + * If we need to retry the mmap_lock has already been released. */ - if (unlikely(fault & VM_FAULT_RETRY)) { - flags |= FAULT_FLAG_TRIED; - goto retry; - } + if (likely(!(fault & VM_FAULT_RETRY))) + mmap_read_unlock(mm); - mmap_read_unlock(mm); #ifdef CONFIG_PER_VMA_LOCK done: #endif - if (likely(!(fault & VM_FAULT_ERROR))) + if (likely(!(fault & (VM_FAULT_ERROR|VM_FAULT_RETRY)))) return; if (fatal_signal_pending(current) && !user_mode(regs)) { @@ -1493,15 +1488,26 @@ void do_user_addr_fault(struct pt_regs *regs, * oom-killed): */ pagefault_out_of_memory(); - } else { - if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON| - VM_FAULT_HWPOISON_LARGE)) - do_sigbus(regs, error_code, address, fault); - else if (fault & VM_FAULT_SIGSEGV) - bad_area_nosemaphore(regs, error_code, address); - else - BUG(); + return; + } + + if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON| + VM_FAULT_HWPOISON_LARGE)) { + do_sigbus(regs, error_code, address, fault); + return; } + + if (fault & VM_FAULT_SIGSEGV) { + bad_area_nosemaphore(regs, error_code, address); + return; + } + + if (fault & VM_FAULT_RETRY) { + flags |= FAULT_FLAG_TRIED; + goto retry; + } + + BUG(); } NOKPROBE_SYMBOL(do_user_addr_fault); diff --git a/mm/filemap.c b/mm/filemap.c index b4c9bd368b7e..f97ca5045c19 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3290,10 +3290,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) FGP_CREAT|FGP_FOR_MMAP, vmf->gfp_mask); if (IS_ERR(folio)) { + ret = VM_FAULT_OOM; if (fpin) goto out_retry; filemap_invalidate_unlock_shared(mapping); - return VM_FAULT_OOM; + return ret; } } @@ -3362,15 +3363,18 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) */ fpin = maybe_unlock_mmap_for_io(vmf, fpin); error = filemap_read_folio(file, mapping->a_ops->read_folio, folio); - if (fpin) - goto out_retry; folio_put(folio); - - if (!error || error == AOP_TRUNCATED_PAGE) + folio = NULL; + if (!error || error == AOP_TRUNCATED_PAGE) { + if (fpin) + goto out_retry; goto retry_find; + } + ret = VM_FAULT_SIGBUS; + if (fpin) + goto out_retry; filemap_invalidate_unlock_shared(mapping); - - return VM_FAULT_SIGBUS; + return ret; out_retry: /* -- 2.40.1