Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Sat, 4 Dec 2021 04:46:24 +0530

On Fri, Dec 03, 2021 at 09:22:54PM IST, Pavel Begunkov wrote:
> On 11/19/21 05:24, Alexei Starovoitov wrote:
> > [...]
> >
> > Also I'd like to hear what Jens and Pavel have to say about
> > applicability of CRIU to io_uring in general.
>

Hi Pavel, thanks for taking a look!

> First, we have no way to know what requests are in flight, without it
> CR doesn't make much sense. The most compelling way for me is to add
> a feature to fail all in-flights as it does when is closed. But maybe,
> you already did solve it somehow?

Indeed, as you note, there is no way to currently inspect in-flight requests,
what we can do is wait for a barrier operation to synchronize against all
previous requests.

So for now, my idea is to drain the ring (by waiting for all in-flight requests
to complete), by using a IOSQE_IO_DRAIN IORING_OP_NOP, and then waiting with a
fixed timeout (so that if forward progress depends on a blocked task/ourselves),
we can fail the dumping. This is ofcourse best effort, but it has worked well
for many of the cases I tested so far.

This might have some other issues, e.g. not being able to accommodate all posted
completions in the CQ ring due to unreaped completions from when it was
checkpointed. In that case we can simply give up, since otherwise recreating the
ring as it was becomes very hard if we let it trample over unread items (it is
unclear how I can send in completitions at restore that were in overflow list at
dump).

One idea I had in mind was to add support to post a dummy CQE entry (by
submitting a e.g. IORING_OP_NOP) where the fields of CQE are set during
submission time. This allows faking a completed request, then at restore we can
push all these into the overflow list and project the state as it were if the CQ
ring was full. At dump time it allows us to continually reap completion items.
If we detect that kernel doesn't support overflow, we fail.

Adjustment of the kernel side tail is not as hard (we can use IORING_OP_NOP
completitions to fill it up, then rewrite entries).

There were other operations (like registering buffers) that had similar side
effect of synchronization of ring state (waiting for it to become idle) before
returning to userspace, but that was pre-5.13.

Also we have to do this ring synchronization fairly early during the dump, since
it can lead to addition of other resources (e.g. fds) to the task that then need
to be dumped again.

> There is probably a way to restore registered buffers and files, though
> it may be tough considering that files may not have corresponding fds in
> the userspace, buffers may be unmapped, buffers may come from
> shmem/etc. and other corner cases.

See [0] for some explanation on all that. CRIU also knows if certain VMA comes
from shmem or not (whose restoration is already handled separately).

>
> There are also not covered here pieces of state, SELECT_BUFFER
> buffers, personalities (aka creds), registered eventfd, io-wq
> configuration, etc. I'm assuming you'll be checking them and
> failing CR if any of them is there.

Personalities are not as hard (IIUC), because all the required state is
available through fdinfo. In the PR linked in this thread, there is code to
parse it and restore using the saved credentials (though we might want to
implement UID mapping options, or either let the user do image rewriting for
that, which is a separate concern).

Ideally I'd like to be able to grab this state from the iterator as well, but it
needs atleast a bpf_xa_for_each helper, since io_uring's show_fdinfo skips some
crucial data when it detects contention over uring_lock (and doesn't indicate
this at all) :(. See the conditional printing on 'has_lock'.

SELECT_BUFFER is indeed unhandled rn. I'm contemplating ways on how to extend
the iterator so that it can loop over all items of generic structures like
Xarray in general while taking appropriate locks relevant for the specific hook
in particular. Both personalities registration and IORING_OP_PROVIDE_BUFFERS
insert use an Xarray, so it might make sense to rather add a bpf_xa_for_each
than introducing another iterator, and only mark it as safe for this iterator
context (where appropriate locks e.g. ctx->uring_lock is held).

For registered eventfd, and io-wq, you can look at [0] to see how I am solving
that, TLDR I just map the underlying structure to the open fd in the task. eBPF
is flexible enough to also allow state inspection in case e.g. the corresponding
eventfd is closed, so that we can recreate it, register, and then close again
when restoring. Same with files directly added to the fixed file set, the whole
idea was to bring in eBPF so that dumping these resources is possible when they
are "hidden" from normal view.

[0]: https://lore.kernel.org/bpf/20211201042333.2035153-11-memxor@xxxxxxxxx

>
> And the last point, there will be some stuff CR of which is
> likely to be a bad idea. E.g. registered dmabuf's,
> pre-registered DMA mappings, zerocopy contexts and so on.
>

Yes, we can just fail the dump for these cases. There are many other cases (in
general) where we just have to give up.

> IOW, if the first point is solved, there may be a subset of ring
> setups that can probably be CR. That should cover a good amount
> of cases. I don't have a strong opinion on the whole thing,
> I guess it depends on the amount of problems to implement
> in-flight cancellations.
>
> --
> Pavel Begunkov

--
Kartikeya