Re: [PATCH] io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL

Jens Axboe <axboe@xxxxxxxxx> · Fri, 8 Dec 2023 14:49:37 -0700

On 12/8/23 2:14 PM, Christian Brauner wrote:
> On Thu, Dec 07, 2023 at 08:11:45PM -0700, Jens Axboe wrote:
>> io_uring can currently open/close regular files or fixed/direct
>> descriptors. Or you can instantiate a fixed descriptor from a regular
>> one, and then close the regular descriptor. But you currently can't turn
>> a purely fixed/direct descriptor into a regular file descriptor.
>>
>> IORING_OP_FIXED_FD_INSTALL adds support for installing a direct
>> descriptor into the normal file table, just like receiving a file
>> descriptor or opening a new file would do. This is all nicely abstracted
>> into receive_fd(), and hence adding support for this is truly trivial.
>>
>> Since direct descriptors are only usable within io_uring itself, it can
>> be useful to turn them into real file descriptors if they ever need to
>> be accessed via normal syscalls. This can either be a transitory thing,
>> or just a permanent transition for a given direct descriptor.
>>
>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>> ---
> 
> Would you mind giving me a Suggested-by?

Sure, I can do that. I'll send out a v2, added a few comments and also
changed the flags location. But no real changes otherwise.

> So this interesting. Usually we allow to receive fds if the task we're
> taking a reference to a file from:
> 
> (1) is cooperating: SCM_RIGHTS
> (2) is the same task that's taking an addition reference: dup*(),
>     seccomp notify addfd
> 
> Both cases ensure that the file->f_count == and fdtable->count == 1
> optimizations continue to work correctly in the regular system call
> fdget*() routines.
> 
> I guess that this is a variant of (2). Effectively a dup*() from a
> private file table into the regular fdtable.

Right.

> So I just want to make sure we don't break this here because we broke
> that on accident before (pidfd_getfd()):
> 
> A fixed file always has file->f_count == 1. The private io_uring fdtable
> is:

It doesn't necessarily always have f_count of 1. One way to instantiate
a direct descriptor is to open the file normally, then register it. If
the task doesn't close the normal file descriptor, then it'd have an
elevated count of 2. But normally you'd expect the normal file to be
closed, or the file opened/instantiated as a direct descriptor upfront.
In that case, it should have a ref of 1.

> * shared between all io workers?

Yep, they are just like threads in that sense and share the file table.

> * accessible to all processes that have mapped the io_uring ring?

The direct descriptors are just like a file_table, except it's private
to the ring itself. If you can invoke io_uring_enter(2) on that ring,
then you have access to that direct descriptor table.

> And fixed files can only be used from within io_uring itself.

Correct, the index only makes sense within a specific ring.

> So multiple caller might issue fixed file rquests to different io
> workers for concurrent operations. The file->f_count stays at 1 for all
> of them. Someone now requests a regular fd for that fixed file.
> 
> So now an fixed-to-fd requests comes in. The file->f_count is bumped > 1
> and that fd is installed into the fdtable of the caller through one of
> the io worker threads spawned for that caller?

No worker thread is involved for this, unless you asked for that by eg
setting IOSQE_ASYNC in the sqe before it is submitted. The default would
be that the task that does io_uring_enter(2) and submits that sqe would
be the one that installs the file X with direct descriptor Y into its
normal file table.

> Assume caller now issues a readdir() operation on that fd. file->f_count
> will be > 1 for as long as it's used as a fixed file. So regular system
> call fdget*() locking rules should work even with that new extension
> added.
> 
> I think that's sound.

Right, I don't think this changes fget/fdget rules.

> But I want to share some doubts as well. I think the concept of fixed
> files is pretty neat. But it poses a new semantic challenge to
> userspace.
> 
> io_uring_setup() gives you an fd back that you use to mmap() the
> io_uring. You make that io_uring_setup() fd a fixed file. You close the
> regular fd but now the io_uring instance stays around. You create and
> register a bunch of fixed files.
> 
> So now you've set up an io_uring instance that is completely made up of
> fixed files including the io_uring file itself.
> 
> So a process that assumes that once it has closed all fds in its fdtable
> of all threads that all resources have been released.
> 
> But that's not true now. At least for some pretty obvious cases such as
> filesystems. For example, io_uring could just pin random filesystems in
> there. Say it holds open a file in a btrfs mount and the service
> manager having closed all fds now tries unmount it and fails
> inexplicably.
> 
> So effectively, fixed files let you completely hide files and that's
> potentially a problem. So cool idea but this might cause us some trouble
> later down the road.

That is true, you'd have to explicitly or implicitly cancel (this
happens at exec time too, for example) requests and then unregister the
fixed files to release those resources. Or teardown the ring, which
would do both. Or just not close the ring fd itself even if it's
registered, then it would continue to reside in the normal file table as
well.

> Reviewed-by: Christian Brauner <brauner@xxxxxxxxxx>

Thanks!

> Two comments below.
> 
>>  include/uapi/linux/io_uring.h |  2 ++
>>  io_uring/opdef.c              |  9 +++++++++
>>  io_uring/openclose.c          | 37 +++++++++++++++++++++++++++++++++++
>>  io_uring/openclose.h          |  3 +++
>>  4 files changed, 51 insertions(+)
>>
>> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
>> index f1c16f817742..af82aab9e632 100644
>> --- a/include/uapi/linux/io_uring.h
>> +++ b/include/uapi/linux/io_uring.h
>> @@ -71,6 +71,7 @@ struct io_uring_sqe {
>>  		__u32		uring_cmd_flags;
>>  		__u32		waitid_flags;
>>  		__u32		futex_flags;
>> +		__u32		install_fd_flags;
>>  	};
>>  	__u64	user_data;	/* data to be passed back at completion time */
>>  	/* pack this to avoid bogus arm OABI complaints */
>> @@ -253,6 +254,7 @@ enum io_uring_op {
>>  	IORING_OP_FUTEX_WAIT,
>>  	IORING_OP_FUTEX_WAKE,
>>  	IORING_OP_FUTEX_WAITV,
>> +	IORING_OP_FIXED_FD_INSTALL,
>>  
>>  	/* this goes last, obviously */
>>  	IORING_OP_LAST,
>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c
>> index 799db44283c7..6705634e5f52 100644
>> --- a/io_uring/opdef.c
>> +++ b/io_uring/opdef.c
>> @@ -469,6 +469,12 @@ const struct io_issue_def io_issue_defs[] = {
>>  		.prep			= io_eopnotsupp_prep,
>>  #endif
>>  	},
>> +	[IORING_OP_FIXED_FD_INSTALL] = {
>> +		.needs_file		= 1,
>> +		.audit_skip		= 1,
>> +		.prep			= io_install_fixed_fd_prep,
>> +		.issue			= io_install_fixed_fd,
>> +	},
>>  };
>>  
>>  const struct io_cold_def io_cold_defs[] = {
>> @@ -704,6 +710,9 @@ const struct io_cold_def io_cold_defs[] = {
>>  	[IORING_OP_FUTEX_WAITV] = {
>>  		.name			= "FUTEX_WAITV",
>>  	},
>> +	[IORING_OP_FIXED_FD_INSTALL] = {
>> +		.name			= "FIXED_FD_INSTALL",
>> +	},
>>  };
>>  
>>  const char *io_uring_get_opcode(u8 opcode)
>> diff --git a/io_uring/openclose.c b/io_uring/openclose.c
>> index fb73adb89067..5b8f79edef26 100644
>> --- a/io_uring/openclose.c
>> +++ b/io_uring/openclose.c
>> @@ -31,6 +31,11 @@ struct io_close {
>>  	u32				file_slot;
>>  };
>>  
>> +struct io_fixed_install {
>> +	struct file			*file;
>> +	int				flags;
>> +};
>> +
>>  static bool io_openat_force_async(struct io_open *open)
>>  {
>>  	/*
>> @@ -254,3 +259,35 @@ int io_close(struct io_kiocb *req, unsigned int issue_flags)
>>  	io_req_set_res(req, ret, 0);
>>  	return IOU_OK;
>>  }
>> +
>> +int io_install_fixed_fd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>> +{
>> +	struct io_fixed_install *ifi;
>> +
>> +	if (sqe->off || sqe->addr || sqe->len || sqe->buf_index ||
>> +	    sqe->splice_fd_in || sqe->addr3)
>> +		return -EINVAL;
>> +
>> +	/* must be a fixed file */
>> +	if (!(req->flags & REQ_F_FIXED_FILE))
>> +		return -EBADF;
>> +
>> +	ifi = io_kiocb_to_cmd(req, struct io_fixed_install);
>> +
>> +	/* really just O_CLOEXEC or not */
>> +	ifi->flags = READ_ONCE(sqe->install_fd_flags);
> 
> I'm a big big fan of having all new fd returning apis return their fds
> O_CLOEXEC by default and forcing userspace to explicitly turn this off
> via fcntl(). pidfds are cloexec by default, so are seccomp notifier fds.

io_uring fd itself is also O_CLOEXEC by default. We can certainly make
this tweak here, but it is directly configurable by the task that issues
the sqe. If you want O_CLOEXEC, then you should set it.

That said, not opposed to making this the default. But it does mean I'd
have to define a private opcode flag for this, so it can be turned off.
At least that seems saner than needing to do fcntl() after the fact.
This isn't a huge issue and we can certainly do that. Let me know what
you prefer!

>> +	return 0;
>> +}
>> +
>> +int io_install_fixed_fd(struct io_kiocb *req, unsigned int issue_flags)
>> +{
>> +	struct io_fixed_install *ifi;
>> +	int ret;
>> +
>> +	ifi = io_kiocb_to_cmd(req, struct io_fixed_install);
>> +	ret = receive_fd(req->file, ifi->flags);
> 
> After changes currently in vfs.misc the helper is now:
> 
> receive_fd(req->file, NULL, ifi->flags)
> 
> So fyi, this will cause a build failure in -next.

Ah, I should probably pull that in first then. Thanks for the headsup.

-- 
Jens Axboe