On 2/4/19 7:19 PM, Jens Axboe wrote: > On 2/3/19 7:56 PM, Al Viro wrote: >> On Wed, Jan 30, 2019 at 02:29:05AM +0100, Jann Horn wrote: >>> On Tue, Jan 29, 2019 at 8:27 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>> We normally have to fget/fput for each IO we do on a file. Even with >>>> the batching we do, the cost of the atomic inc/dec of the file usage >>>> count adds up. >>>> >>>> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes >>>> for the io_uring_register(2) system call. The arguments passed in must >>>> be an array of __s32 holding file descriptors, and nr_args should hold >>>> the number of file descriptors the application wishes to pin for the >>>> duration of the io_uring context (or until IORING_UNREGISTER_FILES is >>>> called). >>>> >>>> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags >>>> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd >>>> to the index in the array passed in to IORING_REGISTER_FILES. >>>> >>>> Files are automatically unregistered when the io_uring context is >>>> torn down. An application need only unregister if it wishes to >>>> register a new set of fds. >>> >>> Crazy idea: >>> >>> Taking a step back, at a high level, basically this patch creates sort >>> of the same difference that you get when you compare the following >>> scenarios for normal multithreaded I/O in userspace: >> >>> This kinda makes me wonder whether this is really something that >>> should be implemented specifically for the io_uring API, or whether it >>> would make sense to somehow handle part of this in the generic VFS >>> code and give the user the ability to prepare a new files_struct that >>> can then be transferred to the worker thread, or something like >>> that... I'm not sure whether there's a particularly clean way to do >>> that though. >> >> Using files_struct for that opens a can of worms you really don't >> want to touch. >> >> Consider the following scenario with any variant of this interface: >> * create io_uring fd. >> * send an SCM_RIGHTS with that fd to AF_UNIX socket. >> * add the descriptor of that AF_UNIX socket to your fd >> * close AF_UNIX fd, close io_uring fd. >> Voila - you've got a shiny leak. No ->release() is called for >> anyone (and you really don't want to do that on ->flush(), because >> otherwise a library helper doing e.g. system("/bin/date") will tear >> down all the io_uring in your process). The socket is held by >> the reference you've stashed into io_uring (whichever way you do >> that). io_uring is held by the reference you've stashed into >> SCM_RIGHTS datagram in queue of the socket. >> >> No matter what, you need net/unix/garbage.c to be aware of that stuff. >> And getting files_struct lifetime mixed into that would be beyond >> any reason. >> >> The only reason for doing that as a descriptor table would be >> avoiding the cost of fget() in whatever uses it, right? Since > > Right, the only purpose of this patch is to avoid doing fget/fput for > each IO. > >> those are *not* the normal syscalls (and fdget() really should not >> be used anywhere other than the very top of syscall's call chain - >> that's another reason why tossing file_struct around like that >> is insane) and since the benefit is all due to the fact that it's >> *NOT* shared, *NOT* modified in parallel, etc., allowing us to >> treat file references as stable... why the hell use the descriptor >> tables at all? > > This one is not a regular system call, since we don't do fget, then IO, > then fput. We hang on to it. But for the non-registered case, it's very > much just like a regular read/write system call, where we fget to do IO > on it, then fput when we are done. > >> All you need is an array of struct file *, explicitly populated. >> With net/unix/garbage.c aware of such beasts. Guess what? We >> do have such an object already. The one net/unix/garbage.c is >> working with. SCM_RIGHTS datagrams, that is. >> >> IOW, can't we give those io_uring descriptors associated struct >> unix_sock? No socket descriptors, no struct socket (probably), >> just the AF_UNIX-specific part thereof. Then teach >> unix_inflight()/unix_notinflight() about getting unix_sock out >> of these guys (incidentally, both would seem to benefit from >> _not_ touching unix_gc_lock in case when there's no unix_sock >> attached to file we are dealing with - I might be missing >> something very subtle about barriers there, but it doesn't >> look likely). > > That might be workable, though I'm not sure we currently have helpers to > just explicitly create a unix_sock by itself. Not familiar with the > networking bits at all, I'll take a look. > >> And make that (i.e. registering the descriptors) mandatory. > > I don't want to make it mandatory, that's very inflexible for managing > tons of files. The registration is useful for specific cases where we > have high frequency of operations on a set of files. Besides, it'd make > the use of the API cumbersome as well for the basic case of just wanting > to do async IO. > >> Hell, combine that with creating io_uring fd, if we really >> care about the syscall count. Benefits: > > We don't care about syscall count for setup as much. If you're doing > registration of a file set, you're expected to do a LOT of IO to those > files. Hence having an extra one for setup is not a concern. My concern > is just making it mandatory to do registration, I don't think that's a > workable alternative. > >> * no file_struct refcount wanking >> * no fget()/fput() (conditional, at that) from kernel >> threads >> * no CLOEXEC-dependent anything; just the teardown >> on the final fput(), whichever way it comes. >> * no fun with duelling garbage collectors. > > The fget/fput from a kernel thread can be solved by just hanging on to > the struct file * when we punt the IO. Right now we don't, which is a > little silly, that should be changed. > > Getting rid of the files_struct{} is doable. OK, I've reworked the initial parts to wire up the io_uring fd to the AF_UNIX garbage collection. As I made it to the file registration part, I wanted to wire up that too. But I don't think there's a need for that - if we have the io_uring fd appropriately protected, we'll be dropping our struct file ** array index when the io_uring fd is released. That should be adequate, we don't need the garbage collection to be aware of those individually. The only part I had to drop for now is the sq thread polling, as that depends on us carrying the files_struct. I'm going to fold that in shortly, but just make it be dependent on having registered files. That avoids needing to fget/fput for that case, and needing registered files for the sq side submission/polling is not a usability issue like it would be for the "normal" use cases. -- Jens Axboe