Re: [PATCH] filemap: Handle error return from __filemap_get_folio()

Peter Xu <peterx@xxxxxxxxxx> · Wed, 10 May 2023 13:27:31 -0700

On Tue, May 09, 2023 at 03:19:18PM -0400, Johannes Weiner wrote:
> On Sat, May 06, 2023 at 10:04:48AM -0700, Linus Torvalds wrote:
> > On Sat, May 6, 2023 at 9:35 AM Linus Torvalds
> > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > And yes, the simplest fix for the "wrong test" would be to just add a
> > > new "out_nofolio" error case after "out_retry", and use that.
> > >
> > > However, even that seems wrong, because the return value for that path
> > > is the wrong one.
> > 
> > Actually, my suggested patch is _also_ wrong.
> > 
> > The problem is that we do need to return VM_FAULT_RETRY to let the
> > caller know that we released the mmap_lock.
> > 
> > And once we return VM_FAULT_RETRY, the other error bits don't even matter.
> > 
> > So while I think the *right* thing to do is to return VM_FAULT_OOM |
> > VM_FAULT_RETRY, that doesn't actually end up working, because if
> > VM_FAULT_RETRY is set, the caller will know that "yes, mmap_lock was
> > dropped", but the callers will also just ignore the other bits and
> > unconditionally retry.
> > 
> > How very very annoying.
> > 
> > This was introduced several years ago by commit 6b4c9f446981
> > ("filemap: drop the mmap_sem for all blocking operations").
> > 
> > Looking at that, we have at least one other similar error case wrong
> > too: the "page_not_uptodate" case carefully checks for IO errors and
> > retries only if there was no error (or for the AOP_TRUNCATED_PAGE)
> > case.
> > 
> > For an actual IO error on page reading, it returns VM_FAULT_SIGBUS.
> > 
> > Except - again - for that "if (fpin) goto out_retry" case, which will
> > just return VM_FAULT_RETRY and retry the fault.
> > 
> > I do not believe that retrying the fault is the right thing to do when
> > we ran out of memory, or when we had an IO error, and I do not think
> > it was intentional that the error handling was changed.
> 
> This is a while ago and the code has changed quite a bit since, so
> bear with me.
> 
> Originally, we only ever did a maximum of two tries: one where the
> lock might be dropped to kick off IO, then a synchronous one. IIRC the
> thinking at the time was that events like OOMs and IO failures are
> rare enough that doing the retry anyway (even if somewhat pointless)
> and reacting to the issue then (if it persisted) was a tradeoff to
> keep the retry logic simple.
> 
> Since 4064b9827063 ("mm: allow VM_FAULT_RETRY for multiple times") we
> don't clear FAULT_FLAG_ALLOW_RETRY anymore though, and we might see
> more than one loop. At least outside the page cache. So I agree it
> makes sense to look at the return value more carefully and act on
> errors more timely in the arch handler.
> 
> Draft patch below. It survives a boot and a will-it-scale smoke test,
> but I haven't put it through the grinder yet.
> 
> One thing that gave me pause is this comment:
> 
> 	/*
> 	 * If we need to retry the mmap_lock has already been released,
> 	 * and if there is a fatal signal pending there is no guarantee
> 	 * that we made any progress. Handle this case first.
> 	 */
> 
> I think it made sense when it was added in 26178ec11ef3 ("x86: mm:
> consolidate VM_FAULT_RETRY handling"). But after 39678191cd89
> ("x86/mm: use helper fault_signal_pending()") it's in a misleading
> location, since the signal handling is above it.
> 
> So I'm removing it, but please let me know if I missed something.
> 
> ---
>  arch/x86/mm/fault.c | 40 +++++++++++++++++++++++-----------------
>  mm/filemap.c        | 18 +++++++++++-------
>  2 files changed, 34 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index e4399983c50c..f1d242be723f 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1456,20 +1456,15 @@ void do_user_addr_fault(struct pt_regs *regs,
>  		return;
>  
>  	/*
> -	 * If we need to retry the mmap_lock has already been released,
> -	 * and if there is a fatal signal pending there is no guarantee
> -	 * that we made any progress. Handle this case first.
> +	 * If we need to retry the mmap_lock has already been released.
>  	 */
> -	if (unlikely(fault & VM_FAULT_RETRY)) {
> -		flags |= FAULT_FLAG_TRIED;
> -		goto retry;
> -	}
> +	if (likely(!(fault & VM_FAULT_RETRY)))
> +		mmap_read_unlock(mm);
>  
> -	mmap_read_unlock(mm);
>  #ifdef CONFIG_PER_VMA_LOCK
>  done:
>  #endif
> -	if (likely(!(fault & VM_FAULT_ERROR)))
> +	if (likely(!(fault & (VM_FAULT_ERROR|VM_FAULT_RETRY))))
>  		return;
>  
>  	if (fatal_signal_pending(current) && !user_mode(regs)) {
> @@ -1493,15 +1488,26 @@ void do_user_addr_fault(struct pt_regs *regs,
>  		 * oom-killed):
>  		 */
>  		pagefault_out_of_memory();
> -	} else {
> -		if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
> -			     VM_FAULT_HWPOISON_LARGE))
> -			do_sigbus(regs, error_code, address, fault);
> -		else if (fault & VM_FAULT_SIGSEGV)
> -			bad_area_nosemaphore(regs, error_code, address);
> -		else
> -			BUG();
> +		return;
> +	}
> +
> +	if (fault & (VM_FAULT_SIGBUS|VM_FAULT_HWPOISON|
> +		     VM_FAULT_HWPOISON_LARGE)) {
> +		do_sigbus(regs, error_code, address, fault);
> +		return;
>  	}
> +
> +	if (fault & VM_FAULT_SIGSEGV) {
> +		bad_area_nosemaphore(regs, error_code, address);
> +		return;
> +	}
> +
> +	if (fault & VM_FAULT_RETRY) {
> +		flags |= FAULT_FLAG_TRIED;
> +		goto retry;
> +	}
> +
> +	BUG();
>  }
>  NOKPROBE_SYMBOL(do_user_addr_fault);
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b4c9bd368b7e..f97ca5045c19 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3290,10 +3290,11 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  					  FGP_CREAT|FGP_FOR_MMAP,
>  					  vmf->gfp_mask);
>  		if (IS_ERR(folio)) {
> +			ret = VM_FAULT_OOM;
>  			if (fpin)
>  				goto out_retry;
>  			filemap_invalidate_unlock_shared(mapping);
> -			return VM_FAULT_OOM;
> +			return ret;
>  		}
>  	}
>  
> @@ -3362,15 +3363,18 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	 */
>  	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>  	error = filemap_read_folio(file, mapping->a_ops->read_folio, folio);
> -	if (fpin)
> -		goto out_retry;
>  	folio_put(folio);
> -
> -	if (!error || error == AOP_TRUNCATED_PAGE)
> +	folio = NULL;
> +	if (!error || error == AOP_TRUNCATED_PAGE) {
> +		if (fpin)
> +			goto out_retry;
>  		goto retry_find;
> +	}
> +	ret = VM_FAULT_SIGBUS;
> +	if (fpin)
> +		goto out_retry;
>  	filemap_invalidate_unlock_shared(mapping);
> -
> -	return VM_FAULT_SIGBUS;
> +	return ret;
>  
>  out_retry:
>  	/*
> -- 
> 2.40.1
> 

The change looks all right to me.

Acked-by: Peter Xu <peterx@xxxxxxxxxx>

For the long term maybe we want to cleanup a bit on the VM_FAULT_* entries,
e.g., here VM_FAULT_RETRY doesn't really mean "we should retry the fault"
but instead only the hint to show that we'ver released the mmap lock when
any error is set.

Meanwhile it's also not the only one to express that because now we also
have VM_FAULT_COMPLETED, so it's debatable which one we should use here
purely from the logic and definition of the retvals.

One thing we can make this slightly cleaner in the future is we can have a
flag sololy for "we have released the mmap lock", so fundamentally we can
consider renaming COMPLETE->MM_RELEASED, then we use RETRY to only hint
"whether we really want to retry" and leave "whether mmap lock released"
for the new flag.

It could look like:

  MM_RELEASED    (new) RETRY
  0              0             -> page fault resolved (old "retval=0")
  0              1             -> retry with mmap held (currently invalid, so let's ignore this)
  1              0             -> resolved and lock released (old COMPLETE)
  1              1             -> lock releaesd,need to retry (old RETRY)

Then IIUC for error cases we can return MM_RELEASED|ERROR for whatever
error (without setting RETRY).

Not sure whether it helps in any form.  Even if it would, it can definitely
be done on top.

Thanks,

-- 
Peter Xu