Re: [PATCH 1/3] io_uring: move to using private ring references

Pavel Begunkov <asml.silence@xxxxxxxxx> · Tue, 15 Aug 2023 23:50:40 +0100

On 8/15/23 22:45, Jens Axboe wrote:
On 8/15/23 11:45 AM, Pavel Begunkov wrote:
On 8/11/23 18:12, Jens Axboe wrote:
io_uring currently uses percpu refcounts for the ring reference. This
works fine, but exiting a ring requires an RCU grace period to lapse
and this slows down ring exit quite a lot.

Add a basic per-cpu counter for our references instead, and use that.
This is in preparation for doing a sync wait on on any request (notably
file) references on ring exit. As we're going to be waiting on ctx refs
going away as well with that, the RCU grace period wait becomes a
noticeable slowdown.

How does it work?

- What prevents io_ring_ref_maybe_done() from miscalculating and either
1) firing while there are refs or
2) not triggering when we put down all refs?
E.g. percpu_ref relies on atomic counting after switching from
percpu mode.

I'm open to critique of it, do you have any specific worries? The
counters are per-cpu, and whenever the REF_DEAD_BIT is set, we sum on
that drop. We should not be grabbing references post that, and any drop

Well, my worry is concurrent modifications and CPU caches

CPU0                  |   CPU1
queue tw // task 1    |
                      | close(ring_fd); // task 2
                      | exit_work() -> kill_refs();
execute tw            |
  handle_tw_list()    |
    get_ref()         |

Sounds like this will try to grab a ref after REF_DEAD_BIT

will just sum the counters.

CPU0 (io-wq)               | CPU1
                           | exit_work() -> kill
io_req_complete_post()     | cancel request
  put_ref()                |   put_ref()

This one seems possible as well. Then let's say those 2
refs we're putting are the last. They both dec, but count
it to 1 because of caches => never frees the ring

I also think, if we combine these 2 scenarios, we get
concurrent put and get, which might result in UAF

- What contexts it can be used from? Task context only? I'll argue we
want to use it in [soft]irq for likes of *task_work_add().

We don't manipulate ctx refs from non-task context right now, or from
hard/soft IRQ. On the task_work side, the request already has a
reference to the ctx. Not sure why you'd want to add more. In any case,
I prefer not to deal with hypotheticals, just the code we have now.

which is not enough to protect it, see [1]. Yes, I optimised it
later with [2] (which is a bit ugly and confusing), but it's not
a hypothetical.

[1] commit 9ffa13ff78a0a55df968a72d6f0ebffccee5c9f4
    io_uring: pin context while queueing deferred tw
[2] commit d73a572df24661851465c821d33c03e70e4b68e5
    io_uring: optimize local tw add ctx pinning

--
Pavel Begunkov