Re: [GIT PULL] Support for the io_uring IO interface

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Mar 8, 2019 at 8:39 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>
> If all else fails, it's also trivial to yank the poll command support for
> now, which would kill the io_req refs.

No, it really wouldn't.

The poll() code thinks it needs the refs, and the poll code is right -
but the refs are just badly done and there's a ref leak in a special
case and just generally ugly code.

The read/write code thinks it _doesn't_ need the refs, and the
read-write code is *WRONG*. It actually does need them.

So what the fs/aio.c and now fs/io_ring.c code *thinks* it can do is:

 - I'm submitting IO

 - I'm not touching the iocb after submission

 - so I'm done with the iocb, and all I need to do is to say that if
the IO submission succeeded, then the completion of the IO will free
the iocb.

That sounds simple, but it is *WRONG*.

Here's roughly what happens for the write code on the submission side:

  call_write_iter(file, req, &iter)
    file->f_op->write_iter(kio, iter);
      filesystem_file_write_iter()
        inode_lock(inode);
        __generic_file_write_iter(iocb, from);
        inode_unlock(inode);
        if (ret > 0)
                ret = generic_write_sync(iocb, ret);

and guess what? The IO completion might for example happen *while*
__generic_file_write_iter() is running, which might be calling
generic_file_direct_write(), and the IO might just effectively finish
immediately.

What happens then? The iocb is released as part of IO completion, and
now that whole *submission* side that is still accessing the inode,
still accessing the file, and still even accessing the "iocb" (for
that "write_sync" case) is all touching something that may not exist
any more, because it's all been released!

See what's going on?

The thing is, the *submission* path absolutely has to keep a reference
to the iocb until the submission is fully and entirely done. So read
and write need to have that ref on the iocb too, and one ref is given
to the actual IO part, while another ref is held by the submission
side.

So it's really not "poll does ref handling wrong, nobody else needs it".

No, poll at least _tries_ to do ref handling, and admittedly gets it
slightly wrong, but at least it gets a B for _effort_.

The read/write paths don't even try. They get an F. They think they
don't need a ref at all, and they are very much mistaken.

So this is why the io_get_req() function simply should never return
that "ref 0" io_kiocb. It should always be initialized with a
ref-count of 2: one for the submission side (which should do a final
io_put() after it has finished submission, so that the iocb and the
file and the inode are guaranteed to stay around for the _whole_ IO
callchain), and one for the actual IO side (which should do it as part
of io_complete).

This is why Al's branch has that whole series that starts off with
that "pin iocb through aio" commit that does that

-   refcount_set(&req->ki_refcnt, 0);
+   refcount_set(&req->ki_refcnt, 2);

in aio_get_req(), and then works at keeping the iocb around until it's
been fully submitted. And then Al fixes all the small details that
I've glossed over above because aio_poll() had various ugly cases, and
cleans things up a bit.

                  Linus



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux