On Fri, Mar 8, 2019 at 8:39 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > > If all else fails, it's also trivial to yank the poll command support for > now, which would kill the io_req refs. No, it really wouldn't. The poll() code thinks it needs the refs, and the poll code is right - but the refs are just badly done and there's a ref leak in a special case and just generally ugly code. The read/write code thinks it _doesn't_ need the refs, and the read-write code is *WRONG*. It actually does need them. So what the fs/aio.c and now fs/io_ring.c code *thinks* it can do is: - I'm submitting IO - I'm not touching the iocb after submission - so I'm done with the iocb, and all I need to do is to say that if the IO submission succeeded, then the completion of the IO will free the iocb. That sounds simple, but it is *WRONG*. Here's roughly what happens for the write code on the submission side: call_write_iter(file, req, &iter) file->f_op->write_iter(kio, iter); filesystem_file_write_iter() inode_lock(inode); __generic_file_write_iter(iocb, from); inode_unlock(inode); if (ret > 0) ret = generic_write_sync(iocb, ret); and guess what? The IO completion might for example happen *while* __generic_file_write_iter() is running, which might be calling generic_file_direct_write(), and the IO might just effectively finish immediately. What happens then? The iocb is released as part of IO completion, and now that whole *submission* side that is still accessing the inode, still accessing the file, and still even accessing the "iocb" (for that "write_sync" case) is all touching something that may not exist any more, because it's all been released! See what's going on? The thing is, the *submission* path absolutely has to keep a reference to the iocb until the submission is fully and entirely done. So read and write need to have that ref on the iocb too, and one ref is given to the actual IO part, while another ref is held by the submission side. So it's really not "poll does ref handling wrong, nobody else needs it". No, poll at least _tries_ to do ref handling, and admittedly gets it slightly wrong, but at least it gets a B for _effort_. The read/write paths don't even try. They get an F. They think they don't need a ref at all, and they are very much mistaken. So this is why the io_get_req() function simply should never return that "ref 0" io_kiocb. It should always be initialized with a ref-count of 2: one for the submission side (which should do a final io_put() after it has finished submission, so that the iocb and the file and the inode are guaranteed to stay around for the _whole_ IO callchain), and one for the actual IO side (which should do it as part of io_complete). This is why Al's branch has that whole series that starts off with that "pin iocb through aio" commit that does that - refcount_set(&req->ki_refcnt, 0); + refcount_set(&req->ki_refcnt, 2); in aio_get_req(), and then works at keeping the iocb around until it's been fully submitted. And then Al fixes all the small details that I've glossed over above because aio_poll() had various ugly cases, and cleans things up a bit. Linus