Re: [PATCH] vfs: move fdput() to right place in ksys_sync_file_range()

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Wed, 11 May 2022 21:43:46 +0000

[bpf folks Cc'd]

On Wed, May 11, 2022 at 07:01:34PM +0000, Eric Biggers wrote:
> On Wed, May 11, 2022 at 04:51:34PM +0100, Matthew Wilcox wrote:
> > On Wed, May 11, 2022 at 11:45:03AM -0400, Chengguang Xu wrote:
> > > Move fdput() to right place in ksys_sync_file_range() to
> > > avoid fdput() after failed fdget().
> > 
> > Why?  fdput() is already conditional on FDPUT_FPUT so you're ...
> > optimising the failure case?
> 
> "fdput() after failed fdget()" has confused people before, so IMO it's worth
> cleaning this up.  But the commit message should make clear that it's a cleanup,
> not a bug fix.  Also I recommend using an early return:
> 
> 	f = fdget(fd);
> 	if (!f.file)
> 		return -EBADF;
> 	ret = sync_file_range(f.file, offset, nbytes, flags);
> 	fdput(f);
> 	return ret;

FWIW, fdput() after failed fdget() is rare, but there's no fundamental reasons
why it would be wrong.  No objections against that patch, anyway.

Out of curiousity, I've just looked at the existing users.  In mainline we have
203 callers of fdput()/fdput_pos(); all but 7 never get reached with NULL ->file.

1) There's ksys_sync_file_range(), kernel_read_file_from_fd() and ksys_readahead() -
all with similar pattern.  I'm not sure that for readahead(2) "not opened for
read" should yield the same error as "bad descriptor", but since it's been a part
of userland ABI for a while...

2) two callers in perf_event_open(2) are playing silly buggers with explicit
        struct fd group = {NULL, 0};
and rely upon "fdput() is a no-op if we hadn't touched that" (note that if
we try to touch it and get NULL ->file from fdget(), we do not hit those fdput()
at all).

3) ovl_aio_put() is hard to follow (and some of the callers are poking
where they shouldn't), no idea if it's correct.  struct fd is manually
constructed there, anyway.

4) bpf generic_map_update_batch() is really asking for trouble.  The comment in
there is wrong:
        f = fdget(ufd); /* bpf_map_do_batch() guarantees ufd is valid */
*NOTHING* we'd done earlier can guarantee that.  We might have a descriptor
table shared with another thread, and it might have very well done dup2() since
the last time we'd looked things up.  IOW, this fdget() is racy - the function
assumes it refers to the same thing that gave us map back in bpf_map_do_batch(),
but it's not guaranteed at all.

I hadn't put together a reproducer, but that code is very suspicious.  As a general
rule, you should treat descriptor table as shared object, modifiable by other
threads.  It can be explicitly locked and it can be explicitly unshared, but
short of that doing a lookup for the same descriptor twice in a row can yield
different results.

What's going on there?  Do you really want the same struct file you've got back in
bpf_map_do_batch() (i.e. the one you've got the map from)?  What should happen
if the descriptor changes its meaning during (or after) the operation?