Re: [PATCH] io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL

Jens Axboe <axboe@xxxxxxxxx> · Fri, 8 Dec 2023 15:06:25 -0700

On 12/8/23 2:56 PM, Christian Brauner wrote:
> On Fri, Dec 08, 2023 at 02:49:37PM -0700, Jens Axboe wrote:
>> On 12/8/23 2:14 PM, Christian Brauner wrote:
>>> On Thu, Dec 07, 2023 at 08:11:45PM -0700, Jens Axboe wrote:
>>>> io_uring can currently open/close regular files or fixed/direct
>>>> descriptors. Or you can instantiate a fixed descriptor from a regular
>>>> one, and then close the regular descriptor. But you currently can't turn
>>>> a purely fixed/direct descriptor into a regular file descriptor.
>>>>
>>>> IORING_OP_FIXED_FD_INSTALL adds support for installing a direct
>>>> descriptor into the normal file table, just like receiving a file
>>>> descriptor or opening a new file would do. This is all nicely abstracted
>>>> into receive_fd(), and hence adding support for this is truly trivial.
>>>>
>>>> Since direct descriptors are only usable within io_uring itself, it can
>>>> be useful to turn them into real file descriptors if they ever need to
>>>> be accessed via normal syscalls. This can either be a transitory thing,
>>>> or just a permanent transition for a given direct descriptor.
>>>>
>>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>>> ---
>>>
>>> Would you mind giving me a Suggested-by?
>>
>> Sure, I can do that. I'll send out a v2, added a few comments and also
>> changed the flags location. But no real changes otherwise.
>>
>>> So this interesting. Usually we allow to receive fds if the task we're
>>> taking a reference to a file from:
>>>
>>> (1) is cooperating: SCM_RIGHTS
>>> (2) is the same task that's taking an addition reference: dup*(),
>>>     seccomp notify addfd
>>>
>>> Both cases ensure that the file->f_count == and fdtable->count == 1
>>> optimizations continue to work correctly in the regular system call
>>> fdget*() routines.
>>>
>>> I guess that this is a variant of (2). Effectively a dup*() from a
>>> private file table into the regular fdtable.
>>
>> Right.
>>
>>> So I just want to make sure we don't break this here because we broke
>>> that on accident before (pidfd_getfd()):
>>>
>>> A fixed file always has file->f_count == 1. The private io_uring fdtable
>>> is:
>>
>> It doesn't necessarily always have f_count of 1. One way to instantiate
>> a direct descriptor is to open the file normally, then register it. If
>> the task doesn't close the normal file descriptor, then it'd have an
>> elevated count of 2. But normally you'd expect the normal file to be
>> closed, or the file opened/instantiated as a direct descriptor upfront.
>> In that case, it should have a ref of 1.
> 
> Yes, of course. I was thinking of the two common cases:
> 
> (1) direct-to-fixed
> (2) open-regular-fd + egister-as-fixed + close-fd
> 
>>
>>> * shared between all io workers?
>>
>> Yep, they are just like threads in that sense and share the file table.
>>
>>> * accessible to all processes that have mapped the io_uring ring?
>>
>> The direct descriptors are just like a file_table, except it's private
>> to the ring itself. If you can invoke io_uring_enter(2) on that ring,
>> then you have access to that direct descriptor table.
>>
>>> And fixed files can only be used from within io_uring itself.
>>
>> Correct, the index only makes sense within a specific ring.
>>
>>> So multiple caller might issue fixed file rquests to different io
>>> workers for concurrent operations. The file->f_count stays at 1 for all
>>> of them. Someone now requests a regular fd for that fixed file.
>>>
>>> So now an fixed-to-fd requests comes in. The file->f_count is bumped > 1
>>> and that fd is installed into the fdtable of the caller through one of
>>> the io worker threads spawned for that caller?
>>
>> No worker thread is involved for this, unless you asked for that by eg
>> setting IOSQE_ASYNC in the sqe before it is submitted. The default would
> 
> But it is possible, that's what I wondered. But that's ok because the io
> worker is just like a regular thread of that process.

It's certainly possible, you are correct. It's just not the default.

>>>> +int io_install_fixed_fd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
>>>> +{
>>>> +	struct io_fixed_install *ifi;
>>>> +
>>>> +	if (sqe->off || sqe->addr || sqe->len || sqe->buf_index ||
>>>> +	    sqe->splice_fd_in || sqe->addr3)
>>>> +		return -EINVAL;
>>>> +
>>>> +	/* must be a fixed file */
>>>> +	if (!(req->flags & REQ_F_FIXED_FILE))
>>>> +		return -EBADF;
>>>> +
>>>> +	ifi = io_kiocb_to_cmd(req, struct io_fixed_install);
>>>> +
>>>> +	/* really just O_CLOEXEC or not */
>>>> +	ifi->flags = READ_ONCE(sqe->install_fd_flags);
>>>
>>> I'm a big big fan of having all new fd returning apis return their fds
>>> O_CLOEXEC by default and forcing userspace to explicitly turn this off
>>> via fcntl(). pidfds are cloexec by default, so are seccomp notifier fds.
>>
>> io_uring fd itself is also O_CLOEXEC by default. We can certainly make
>> this tweak here, but it is directly configurable by the task that issues
>> the sqe. If you want O_CLOEXEC, then you should set it.
>>
>> That said, not opposed to making this the default. But it does mean I'd
>> have to define a private opcode flag for this, so it can be turned off.
>> At least that seems saner than needing to do fcntl() after the fact.
>> This isn't a huge issue and we can certainly do that. Let me know what
>> you prefer!
> 
> Meh, new opcode would suck. Don't deviate from the standard apis then.

Not a new opcode, it'd just be a flag for that opcode. We default to
O_CLOEXEC is nothing is given, and you can do:

io_uring_prep_fixed_fd_install(sqe, fixed_index, IORING_FIXED_FD_NO_CLOEXEC);

to simply set that flag to turn it off. Only reason I bring it up as a
bit annoying is that it'd be cleaner to have it be part of the O_*
namespace as O_NOCLOEXEC, but it's not a huge deal.

It retains the part you cared about, which is making O_CLOEXEC the
default, but retains the option of turning it off rather than needing to
do an fcntl() to retrieve flags, mask it, then another fcntl().

-- 
Jens Axboe