Re: [PATCH bpf-next v1 1/8] io_uring: Implement eBPF iterator for registered buffers

Yonghong Song <yhs@xxxxxx> · Thu, 18 Nov 2021 11:13:02 -0800

On 11/18/21 10:28 AM, Kumar Kartikeya Dwivedi wrote:
On Thu, Nov 18, 2021 at 10:51:59PM IST, Yonghong Song wrote:


On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
This change adds eBPF iterator for buffers registered in io_uring ctx.
It gives access to the ctx, the index of the registered buffer, and a
pointer to the io_uring_ubuf itself. This allows the iterator to save
info related to buffers added to an io_uring instance, that isn't easy
to export using the fdinfo interface (like exact struct page composing
the registered buffer).

The primary usecase this is enabling is checkpoint/restore support.

Note that we need to use mutex_trylock when the file is read from, in
seq_start functions, as the order of lock taken is opposite of what it
would be when io_uring operation reads the same file.  We take
seq_file->lock, then ctx->uring_lock, while io_uring would first take
ctx->uring_lock and then seq_file->lock for the same ctx.

This can lead to a deadlock scenario described below:

        CPU 0				CPU 1

        vfs_read
        mutex_lock(&seq_file->lock)	io_read
					mutex_lock(&ctx->uring_lock)
        mutex_lock(&ctx->uring_lock) # switched to mutex_trylock
					mutex_lock(&seq_file->lock)

It is not clear which mutex_lock switched to mutex_trylock.

The one in vfs_read.

 From below example, it looks like &ctx->uring_lock. But if this is
the case, we could have deadlock, right?


Sorry, I will make the commit message clearer in the next version.

The sequence on CPU 0 is for normal read(2) on iterator.
For CPU 1, it is an io_uring instance trying to do same on iterator attached to
itself.

So CPU 0 does

sys_read
vfs_read
  bpf_seq_read
  mutex_lock(&seq_file->lock)    # A
   io_uring_buf_seq_start
   mutex_lock(&ctx->uring_lock)  # B

and CPU 1 does

io_uring_enter
mutex_lock(&ctx->uring_lock)    # B
  io_read
   bpf_seq_read
   mutex_lock(&seq_file->lock)   # A
   ...

Since the order of locks is opposite, it can deadlock. So I switched the
mutex_lock in io_uring_buf_seq_start to trylock, so it can return an error for
this case, then it will release seq_file->lock and CPU 1 will make progress.

The second problem without use of trylock is described below (for same case of
io_uring reading from iterator attached to itself).

Let me know if I missed something.

Thanks for explanation. The above diagram is much better.



The trylock also protects the case where io_uring tries to read from
iterator attached to itself (same ctx), where the order of locks would
be:
   io_uring_enter
    mutex_lock(&ctx->uring_lock) <-----------.
    io_read				    \
     seq_read				     \
      mutex_lock(&seq_file->lock)		     /
      mutex_lock(&ctx->uring_lock) # deadlock-`

In both these cases (recursive read and contended uring_lock), -EDEADLK
is returned to userspace.

In the future, this iterator will be extended to directly support
iteration of bvec Flexible Array Member, so that when there is no
corresponding VMA that maps to the registered buffer (e.g. if VMA is
destroyed after pinning pages), we are able to reconstruct the
registration on restore by dumping the page contents and then replaying
them into a temporary mapping used for registration later. All this is
out of scope for the current series however, but builds upon this
iterator.

Cc: Jens Axboe <axboe@xxxxxxxxx>
Cc: Pavel Begunkov <asml.silence@xxxxxxxxx>
Cc: io-uring@xxxxxxxxxxxxxxx
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
---
   fs/io_uring.c                  | 179 +++++++++++++++++++++++++++++++++
   include/linux/bpf.h            |   2 +
   include/uapi/linux/bpf.h       |   3 +
   tools/include/uapi/linux/bpf.h |   3 +
   4 files changed, 187 insertions(+)

[...]

[...]
+static struct bpf_iter_reg io_uring_buf_reg_info = {
+	.target            = "io_uring_buf",
+	.feature	   = BPF_ITER_RESCHED,
+	.attach_target     = bpf_io_uring_iter_attach,
+	.detach_target     = bpf_io_uring_iter_detach,

Since you have this extra `io_uring_fd` for the iterator, you may want
to implement show_fdinfo and fill_link_info callback functions here.


Ack, but some questions:

What should it have? e.g. it easy to go from map_id to map fd if one wants
access to the map attached to the iterator, but not sure how one can obtain more
information about target fd from io_uring or epoll iterators.

Just to be clear, I am talking about uapi struct bpf_link_info.
I agree that fd is not really useful. So I guess it is up to you
whether you want to show fd to user or not. We can always
add it later if needed.


Should I delegate to their show_fdinfo and dump using that?

The type/target is already available in link_info, not sure what other useful
information can be added there, which allows obtaining the io_uring/epoll fd.

+	.ctx_arg_info_size = 2,
+	.ctx_arg_info = {
+		{ offsetof(struct bpf_iter__io_uring_buf, ctx),
+		  PTR_TO_BTF_ID },
+		{ offsetof(struct bpf_iter__io_uring_buf, ubuf),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info	   = &bpf_io_uring_buf_seq_info,
+};
+
+static int __init io_uring_iter_init(void)
+{
+	io_uring_buf_reg_info.ctx_arg_info[0].btf_id = btf_io_uring_ids[0];
+	io_uring_buf_reg_info.ctx_arg_info[1].btf_id = btf_io_uring_ids[1];
+	return bpf_iter_reg_target(&io_uring_buf_reg_info);
+}
+late_initcall(io_uring_iter_init);
+
+#endif

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 56098c866704..ddb9d4520a3f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1509,8 +1509,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
   	extern int bpf_iter_ ## target(args);			\
   	int __init bpf_iter_ ## target(args) { return 0; }
+struct io_ring_ctx;
   struct bpf_iter_aux_info {
   	struct bpf_map *map;
+	struct io_ring_ctx *ctx;
   };

Can we use union here? Note that below bpf_iter_link_info in
uapi/linux/bpf.h, map_fd and io_uring_fd is also an union.


So the reason I didn't use a union was the link->aux.map check in
bpf_iter.c::__get_seq_info. Even if we switch to union bpf_iter_aux_info, it
needs some way to determine whether link is for map type, so maybe a string
comparison there? Leaving it out of union felt cleaner, also I move both
io_ring_ctx and eventpoll file into a union in later patch.

I see. the seq_ops for map element iterator is different from others.
the seq_ops is not from main iter registration but from map_ops.

I think your change is okay. But maybe a comment to explain why
map is different from others in struct bpf_iter_aux_info.


   typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6297eafdc40f..3323defa99a1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
   	struct {
   		__u32	map_fd;
   	} map;
+	struct {
+		__u32   io_uring_fd;
+	} io_uring;
   };
   /* BPF syscall commands, see bpf(2) man-page for more details. */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 6297eafdc40f..3323defa99a1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -91,6 +91,9 @@ union bpf_iter_link_info {
   	struct {
   		__u32	map_fd;
   	} map;
+	struct {
+		__u32   io_uring_fd;
+	} io_uring;
   };
   /* BPF syscall commands, see bpf(2) man-page for more details. */


--
Kartikeya