On Fri, Nov 19, 2021 at 03:32:26AM IST, Alexei Starovoitov wrote: > On Tue, Nov 16, 2021 at 11:12:30AM +0530, Kumar Kartikeya Dwivedi wrote: > > This change adds eBPF iterator for buffers registered in io_uring ctx. > > It gives access to the ctx, the index of the registered buffer, and a > > pointer to the io_uring_ubuf itself. This allows the iterator to save > > info related to buffers added to an io_uring instance, that isn't easy > > to export using the fdinfo interface (like exact struct page composing > > the registered buffer). > > > > The primary usecase this is enabling is checkpoint/restore support. > > > > Note that we need to use mutex_trylock when the file is read from, in > > seq_start functions, as the order of lock taken is opposite of what it > > would be when io_uring operation reads the same file. We take > > seq_file->lock, then ctx->uring_lock, while io_uring would first take > > ctx->uring_lock and then seq_file->lock for the same ctx. > > > > This can lead to a deadlock scenario described below: > > > > CPU 0 CPU 1 > > > > vfs_read > > mutex_lock(&seq_file->lock) io_read > > mutex_lock(&ctx->uring_lock) > > mutex_lock(&ctx->uring_lock) # switched to mutex_trylock > > mutex_lock(&seq_file->lock) > > > > The trylock also protects the case where io_uring tries to read from > > iterator attached to itself (same ctx), where the order of locks would > > be: > > io_uring_enter > > mutex_lock(&ctx->uring_lock) <-----------. > > io_read \ > > seq_read \ > > mutex_lock(&seq_file->lock) / > > mutex_lock(&ctx->uring_lock) # deadlock-` > > > > In both these cases (recursive read and contended uring_lock), -EDEADLK > > is returned to userspace. > > > > In the future, this iterator will be extended to directly support > > iteration of bvec Flexible Array Member, so that when there is no > > corresponding VMA that maps to the registered buffer (e.g. if VMA is > > destroyed after pinning pages), we are able to reconstruct the > > registration on restore by dumping the page contents and then replaying > > them into a temporary mapping used for registration later. All this is > > out of scope for the current series however, but builds upon this > > iterator. > > From BPF infra perspective these new iterators fit very well and > I don't see any issues maintaining this interface while kernel keeps > changing, but this commit log and shallowness of the selftests > makes me question feasibility of this approach in particular with io_uring. > Is it even possible to scan all internal bits of io_uring and reconstruct > it later? The bpf iter is only the read part. Don't you need the write part > for CRIU ? Even for reads only... io_uring has complex inner state. Yes, the inner state is complex and often entangled with other task attributes, like credentials, mm, file descriptors, other io_uring fds (wq_fd, etc.) but for now these iterators are a good starting point to implement the majority of the missing features in CRIU userspace. These iterators (including task* ones), and procfs allow us to collect enough state to correlate various resources and form relationships (e.g. which fd or buffer of which task(s) is registered, which io_uring was used for wq_fd registration, or which eventfd was registered, etc.). Thanks to access to io_ring_ctx, and iter being a tracing prog, we can do usual pointer access to read some of that data. > Like bpf itself which cannot be realistically CRIU-ed. > I don't think we can merge this in pieces. We need to wait until there is > full working CRIU framework that uses these new iterators. I don't intend to add a 'write' framework. The usual process with CRIU is to gather enough state to reconstruct the kernel resource by repeating steps similar to what it might have performed during it's lifetime, e.g. approximating a trace of execution leading to the same task state and then repeating that on restore. While a 'write' framework with first class kernel support for checkpoint/restore would actually make CRIU's life a lot more simpler, there usually has been a lot of pushback against interfaces like that, hence the approach has been to add features that can extract the relevant information out of the kernel, and do the rest of the work in userspace. E.g. if we find that fd 1 and 2 are registered in the io_uring, and buffer at 0xabcd with len 128, then we first restore fd 1 and 2 (which can be anything) at restore time, then after this restore is complete, continue restoration of io_uring by executing file registration step for both [0], and depending on if the mapping existed for 0xabcd in task mm, restore buffer once the mm of the task has been restored, or otherwise map a temporary buffer, replay data, and register it and destroy mappings. This would be similar to what might have happened in the task itself. The correct order of events to do the restore will be managed by CRIU itself. It doesn't have to be exactly the same stpes, just enough for the task to not notice a difference and not break its assumptions, and lead to the same end result of task and resource state. Even the task of determining whether fd 1 and 2 belong to which fdtable is non-trivial and ineffecient rn. You can look at [0] for a similar case that was solved for epoll, which works but we can do better (BPF didn't exist at that time so it was the best we could do back then). Also, this work is part of GSoC. There is already code that is waiting for this to fill in the missing pieces [0]. If you want me to add a sample/selftest that demonstrates/tests how this can be used to reconstruct a task's io_uring, I can certainly do that. We've already spent a few months contemplating on a few approaches and this turned out to be the best/most powerful. At one point I had to scrap some my earlier patches completely because they couldn't work with descriptorless io_uring. Iterator seem like the best solution so far that can adapt gracefully to feature additions in something seeing as heavy development as io_uring. [0]: https://github.com/checkpoint-restore/criu/commit/cfa3f405d522334076fc4d687bd077bee3186ccf#diff-d2cfa5a05213c854d539de003a23a286311ae81431026d3d50b0068c0cb5a852 [1]: https://github.com/checkpoint-restore/criu/pull/1597 -- Kartikeya