On Tue, Nov 16, 2021 at 11:12:30AM +0530, Kumar Kartikeya Dwivedi wrote: > This change adds eBPF iterator for buffers registered in io_uring ctx. > It gives access to the ctx, the index of the registered buffer, and a > pointer to the io_uring_ubuf itself. This allows the iterator to save > info related to buffers added to an io_uring instance, that isn't easy > to export using the fdinfo interface (like exact struct page composing > the registered buffer). > > The primary usecase this is enabling is checkpoint/restore support. > > Note that we need to use mutex_trylock when the file is read from, in > seq_start functions, as the order of lock taken is opposite of what it > would be when io_uring operation reads the same file. We take > seq_file->lock, then ctx->uring_lock, while io_uring would first take > ctx->uring_lock and then seq_file->lock for the same ctx. > > This can lead to a deadlock scenario described below: > > CPU 0 CPU 1 > > vfs_read > mutex_lock(&seq_file->lock) io_read > mutex_lock(&ctx->uring_lock) > mutex_lock(&ctx->uring_lock) # switched to mutex_trylock > mutex_lock(&seq_file->lock) > > The trylock also protects the case where io_uring tries to read from > iterator attached to itself (same ctx), where the order of locks would > be: > io_uring_enter > mutex_lock(&ctx->uring_lock) <-----------. > io_read \ > seq_read \ > mutex_lock(&seq_file->lock) / > mutex_lock(&ctx->uring_lock) # deadlock-` > > In both these cases (recursive read and contended uring_lock), -EDEADLK > is returned to userspace. > > In the future, this iterator will be extended to directly support > iteration of bvec Flexible Array Member, so that when there is no > corresponding VMA that maps to the registered buffer (e.g. if VMA is > destroyed after pinning pages), we are able to reconstruct the > registration on restore by dumping the page contents and then replaying > them into a temporary mapping used for registration later. All this is > out of scope for the current series however, but builds upon this > iterator. >From BPF infra perspective these new iterators fit very well and I don't see any issues maintaining this interface while kernel keeps changing, but this commit log and shallowness of the selftests makes me question feasibility of this approach in particular with io_uring. Is it even possible to scan all internal bits of io_uring and reconstruct it later? The bpf iter is only the read part. Don't you need the write part for CRIU ? Even for reads only... io_uring has complex inner state. Like bpf itself which cannot be realistically CRIU-ed. I don't think we can merge this in pieces. We need to wait until there is full working CRIU framework that uses these new iterators.