On 12/8/23 2:56 PM, Christian Brauner wrote: > On Fri, Dec 08, 2023 at 02:49:37PM -0700, Jens Axboe wrote: >> On 12/8/23 2:14 PM, Christian Brauner wrote: >>> On Thu, Dec 07, 2023 at 08:11:45PM -0700, Jens Axboe wrote: >>>> io_uring can currently open/close regular files or fixed/direct >>>> descriptors. Or you can instantiate a fixed descriptor from a regular >>>> one, and then close the regular descriptor. But you currently can't turn >>>> a purely fixed/direct descriptor into a regular file descriptor. >>>> >>>> IORING_OP_FIXED_FD_INSTALL adds support for installing a direct >>>> descriptor into the normal file table, just like receiving a file >>>> descriptor or opening a new file would do. This is all nicely abstracted >>>> into receive_fd(), and hence adding support for this is truly trivial. >>>> >>>> Since direct descriptors are only usable within io_uring itself, it can >>>> be useful to turn them into real file descriptors if they ever need to >>>> be accessed via normal syscalls. This can either be a transitory thing, >>>> or just a permanent transition for a given direct descriptor. >>>> >>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >>>> --- >>> >>> Would you mind giving me a Suggested-by? >> >> Sure, I can do that. I'll send out a v2, added a few comments and also >> changed the flags location. But no real changes otherwise. >> >>> So this interesting. Usually we allow to receive fds if the task we're >>> taking a reference to a file from: >>> >>> (1) is cooperating: SCM_RIGHTS >>> (2) is the same task that's taking an addition reference: dup*(), >>> seccomp notify addfd >>> >>> Both cases ensure that the file->f_count == and fdtable->count == 1 >>> optimizations continue to work correctly in the regular system call >>> fdget*() routines. >>> >>> I guess that this is a variant of (2). Effectively a dup*() from a >>> private file table into the regular fdtable. >> >> Right. >> >>> So I just want to make sure we don't break this here because we broke >>> that on accident before (pidfd_getfd()): >>> >>> A fixed file always has file->f_count == 1. The private io_uring fdtable >>> is: >> >> It doesn't necessarily always have f_count of 1. One way to instantiate >> a direct descriptor is to open the file normally, then register it. If >> the task doesn't close the normal file descriptor, then it'd have an >> elevated count of 2. But normally you'd expect the normal file to be >> closed, or the file opened/instantiated as a direct descriptor upfront. >> In that case, it should have a ref of 1. > > Yes, of course. I was thinking of the two common cases: > > (1) direct-to-fixed > (2) open-regular-fd + egister-as-fixed + close-fd > >> >>> * shared between all io workers? >> >> Yep, they are just like threads in that sense and share the file table. >> >>> * accessible to all processes that have mapped the io_uring ring? >> >> The direct descriptors are just like a file_table, except it's private >> to the ring itself. If you can invoke io_uring_enter(2) on that ring, >> then you have access to that direct descriptor table. >> >>> And fixed files can only be used from within io_uring itself. >> >> Correct, the index only makes sense within a specific ring. >> >>> So multiple caller might issue fixed file rquests to different io >>> workers for concurrent operations. The file->f_count stays at 1 for all >>> of them. Someone now requests a regular fd for that fixed file. >>> >>> So now an fixed-to-fd requests comes in. The file->f_count is bumped > 1 >>> and that fd is installed into the fdtable of the caller through one of >>> the io worker threads spawned for that caller? >> >> No worker thread is involved for this, unless you asked for that by eg >> setting IOSQE_ASYNC in the sqe before it is submitted. The default would > > But it is possible, that's what I wondered. But that's ok because the io > worker is just like a regular thread of that process. It's certainly possible, you are correct. It's just not the default. >>>> +int io_install_fixed_fd_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>>> +{ >>>> + struct io_fixed_install *ifi; >>>> + >>>> + if (sqe->off || sqe->addr || sqe->len || sqe->buf_index || >>>> + sqe->splice_fd_in || sqe->addr3) >>>> + return -EINVAL; >>>> + >>>> + /* must be a fixed file */ >>>> + if (!(req->flags & REQ_F_FIXED_FILE)) >>>> + return -EBADF; >>>> + >>>> + ifi = io_kiocb_to_cmd(req, struct io_fixed_install); >>>> + >>>> + /* really just O_CLOEXEC or not */ >>>> + ifi->flags = READ_ONCE(sqe->install_fd_flags); >>> >>> I'm a big big fan of having all new fd returning apis return their fds >>> O_CLOEXEC by default and forcing userspace to explicitly turn this off >>> via fcntl(). pidfds are cloexec by default, so are seccomp notifier fds. >> >> io_uring fd itself is also O_CLOEXEC by default. We can certainly make >> this tweak here, but it is directly configurable by the task that issues >> the sqe. If you want O_CLOEXEC, then you should set it. >> >> That said, not opposed to making this the default. But it does mean I'd >> have to define a private opcode flag for this, so it can be turned off. >> At least that seems saner than needing to do fcntl() after the fact. >> This isn't a huge issue and we can certainly do that. Let me know what >> you prefer! > > Meh, new opcode would suck. Don't deviate from the standard apis then. Not a new opcode, it'd just be a flag for that opcode. We default to O_CLOEXEC is nothing is given, and you can do: io_uring_prep_fixed_fd_install(sqe, fixed_index, IORING_FIXED_FD_NO_CLOEXEC); to simply set that flag to turn it off. Only reason I bring it up as a bit annoying is that it'd be cleaner to have it be part of the O_* namespace as O_NOCLOEXEC, but it's not a huge deal. It retains the part you cared about, which is making O_CLOEXEC the default, but retains the option of turning it off rather than needing to do an fcntl() to retrieve flags, mask it, then another fcntl(). -- Jens Axboe