On Fri, Dec 03, 2021 at 09:22:54PM IST, Pavel Begunkov wrote: > On 11/19/21 05:24, Alexei Starovoitov wrote: > > [...] > > > > Also I'd like to hear what Jens and Pavel have to say about > > applicability of CRIU to io_uring in general. > Hi Pavel, thanks for taking a look! > First, we have no way to know what requests are in flight, without it > CR doesn't make much sense. The most compelling way for me is to add > a feature to fail all in-flights as it does when is closed. But maybe, > you already did solve it somehow? Indeed, as you note, there is no way to currently inspect in-flight requests, what we can do is wait for a barrier operation to synchronize against all previous requests. So for now, my idea is to drain the ring (by waiting for all in-flight requests to complete), by using a IOSQE_IO_DRAIN IORING_OP_NOP, and then waiting with a fixed timeout (so that if forward progress depends on a blocked task/ourselves), we can fail the dumping. This is ofcourse best effort, but it has worked well for many of the cases I tested so far. This might have some other issues, e.g. not being able to accommodate all posted completions in the CQ ring due to unreaped completions from when it was checkpointed. In that case we can simply give up, since otherwise recreating the ring as it was becomes very hard if we let it trample over unread items (it is unclear how I can send in completitions at restore that were in overflow list at dump). One idea I had in mind was to add support to post a dummy CQE entry (by submitting a e.g. IORING_OP_NOP) where the fields of CQE are set during submission time. This allows faking a completed request, then at restore we can push all these into the overflow list and project the state as it were if the CQ ring was full. At dump time it allows us to continually reap completion items. If we detect that kernel doesn't support overflow, we fail. Adjustment of the kernel side tail is not as hard (we can use IORING_OP_NOP completitions to fill it up, then rewrite entries). There were other operations (like registering buffers) that had similar side effect of synchronization of ring state (waiting for it to become idle) before returning to userspace, but that was pre-5.13. Also we have to do this ring synchronization fairly early during the dump, since it can lead to addition of other resources (e.g. fds) to the task that then need to be dumped again. > There is probably a way to restore registered buffers and files, though > it may be tough considering that files may not have corresponding fds in > the userspace, buffers may be unmapped, buffers may come from > shmem/etc. and other corner cases. See [0] for some explanation on all that. CRIU also knows if certain VMA comes from shmem or not (whose restoration is already handled separately). > > There are also not covered here pieces of state, SELECT_BUFFER > buffers, personalities (aka creds), registered eventfd, io-wq > configuration, etc. I'm assuming you'll be checking them and > failing CR if any of them is there. Personalities are not as hard (IIUC), because all the required state is available through fdinfo. In the PR linked in this thread, there is code to parse it and restore using the saved credentials (though we might want to implement UID mapping options, or either let the user do image rewriting for that, which is a separate concern). Ideally I'd like to be able to grab this state from the iterator as well, but it needs atleast a bpf_xa_for_each helper, since io_uring's show_fdinfo skips some crucial data when it detects contention over uring_lock (and doesn't indicate this at all) :(. See the conditional printing on 'has_lock'. SELECT_BUFFER is indeed unhandled rn. I'm contemplating ways on how to extend the iterator so that it can loop over all items of generic structures like Xarray in general while taking appropriate locks relevant for the specific hook in particular. Both personalities registration and IORING_OP_PROVIDE_BUFFERS insert use an Xarray, so it might make sense to rather add a bpf_xa_for_each than introducing another iterator, and only mark it as safe for this iterator context (where appropriate locks e.g. ctx->uring_lock is held). For registered eventfd, and io-wq, you can look at [0] to see how I am solving that, TLDR I just map the underlying structure to the open fd in the task. eBPF is flexible enough to also allow state inspection in case e.g. the corresponding eventfd is closed, so that we can recreate it, register, and then close again when restoring. Same with files directly added to the fixed file set, the whole idea was to bring in eBPF so that dumping these resources is possible when they are "hidden" from normal view. [0]: https://lore.kernel.org/bpf/20211201042333.2035153-11-memxor@xxxxxxxxx > > And the last point, there will be some stuff CR of which is > likely to be a bad idea. E.g. registered dmabuf's, > pre-registered DMA mappings, zerocopy contexts and so on. > Yes, we can just fail the dump for these cases. There are many other cases (in general) where we just have to give up. > IOW, if the first point is solved, there may be a subset of ring > setups that can probably be CR. That should cover a good amount > of cases. I don't have a strong opinion on the whole thing, > I guess it depends on the amount of problems to implement > in-flight cancellations. > > -- > Pavel Begunkov -- Kartikeya