Re: [GIT PULL] Support for the io_uring IO interface

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 3/8/19 10:41 PM, Linus Torvalds wrote:
> On Fri, Mar 8, 2019 at 8:39 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>
>> If all else fails, it's also trivial to yank the poll command support for
>> now, which would kill the io_req refs.
> 
> No, it really wouldn't.
> 
> The poll() code thinks it needs the refs, and the poll code is right -
> but the refs are just badly done and there's a ref leak in a special
> case and just generally ugly code.
> 
> The read/write code thinks it _doesn't_ need the refs, and the
> read-write code is *WRONG*. It actually does need them.
> 
> So what the fs/aio.c and now fs/io_ring.c code *thinks* it can do is:
> 
>  - I'm submitting IO
> 
>  - I'm not touching the iocb after submission
> 
>  - so I'm done with the iocb, and all I need to do is to say that if
> the IO submission succeeded, then the completion of the IO will free
> the iocb.
> 
> That sounds simple, but it is *WRONG*.
> 
> Here's roughly what happens for the write code on the submission side:
> 
>   call_write_iter(file, req, &iter)
>     file->f_op->write_iter(kio, iter);
>       filesystem_file_write_iter()
>         inode_lock(inode);
>         __generic_file_write_iter(iocb, from);
>         inode_unlock(inode);
>         if (ret > 0)
>                 ret = generic_write_sync(iocb, ret);
> 
> and guess what? The IO completion might for example happen *while*
> __generic_file_write_iter() is running, which might be calling
> generic_file_direct_write(), and the IO might just effectively finish
> immediately.
> 
> What happens then? The iocb is released as part of IO completion, and
> now that whole *submission* side that is still accessing the inode,
> still accessing the file, and still even accessing the "iocb" (for
> that "write_sync" case) is all touching something that may not exist
> any more, because it's all been released!
> 
> See what's going on?
> 
> The thing is, the *submission* path absolutely has to keep a reference
> to the iocb until the submission is fully and entirely done. So read
> and write need to have that ref on the iocb too, and one ref is given
> to the actual IO part, while another ref is held by the submission
> side.
> 
> So it's really not "poll does ref handling wrong, nobody else needs it".
> 
> No, poll at least _tries_ to do ref handling, and admittedly gets it
> slightly wrong, but at least it gets a B for _effort_.
> 
> The read/write paths don't even try. They get an F. They think they
> don't need a ref at all, and they are very much mistaken.
> 
> So this is why the io_get_req() function simply should never return
> that "ref 0" io_kiocb. It should always be initialized with a
> ref-count of 2: one for the submission side (which should do a final
> io_put() after it has finished submission, so that the iocb and the
> file and the inode are guaranteed to stay around for the _whole_ IO
> callchain), and one for the actual IO side (which should do it as part
> of io_complete).
> 
> This is why Al's branch has that whole series that starts off with
> that "pin iocb through aio" commit that does that
> 
> -   refcount_set(&req->ki_refcnt, 0);
> +   refcount_set(&req->ki_refcnt, 2);
> 
> in aio_get_req(), and then works at keeping the iocb around until it's
> been fully submitted. And then Al fixes all the small details that
> I've glossed over above because aio_poll() had various ugly cases, and
> cleans things up a bit.

The bit I was missing was sync completion combined with the iocb
being used in the not after submission, but during that part.

I've reworked the references to follow the above proposed scheme, one
for submission and one for completion.

I've also ported Al's race fix for the poll implementation. Will throw
it all into some testing.

Thanks Linus!

-- 
Jens Axboe




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux