Re: [PATCH bpf-next v1 4/8] epoll: Implement eBPF iterator for registered items

Yonghong Song <yhs@xxxxxx> · Thu, 18 Nov 2021 09:50:29 -0800

On 11/15/21 9:42 PM, Kumar Kartikeya Dwivedi wrote:
This patch adds eBPF iterator for epoll items (epitems) registered in an
epoll instance. It gives access to the eventpoll ctx, and the registered
epoll item (struct epitem). This allows the iterator to inspect the
registered file and be able to use others iterators to associate it with
a task's fdtable.

The primary usecase this is enabling is expediting existing eventpoll
checkpoint/restore support in the CRIU project. This iterator allows us
to switch from a worst case O(n^2) algorithm to a single O(n) pass over
task and epoll registered descriptors.

We also make sure we're iterating over a live file, one that is not
going away. The case we're concerned about is a file that has its
f_count as zero, but is waiting for iterator bpf_seq_read to release
ep->mtx, so that it can remove its epitem. Since such a file will
disappear once iteration is done, and it is being destructed, we use
get_file_rcu to ensure it is alive when invoking the BPF program.

Getting access to a file that is going to disappear after iteration
is not useful anyway. This does have a performance overhead however
(since file reference will be raised and dropped for each file).

The rcu_read_lock around get_file_rcu isn't strictly required for
lifetime management since fput path is serialized on ep->mtx to call
ep_remove, hence the epi->ffd.file pointer remains stable during our
seq_start/seq_stop bracketing.

To be able to continue back from the position we were iterating, we
store the epi->ffi.fd and use ep_find_tfd to find the target file again.
It would be more appropriate to use both struct file pointer and fd
number to find the last file, but see below for why that cannot be done.

Taking reference to struct file and walking RB-Tree to find it again
will lead to reference cycle issue if the iterator after partial read
takes reference to socket which later is used in creating a descriptor
cycle using SCM_RIGHTS. An example that was encountered when working on
this is mentioned below.

   Let there be Unix sockets SK1, SK2, epoll fd EP, and epoll iterator
   ITER.
   Let SK1 be registered in EP, then on a partial read it is possible
   that ITER returns from read and takes reference to SK1 to be able to
   find it later in RB-Tree and continue the iteration.  If SK1 sends
   ITER over to SK2 using SCM_RIGHTS, and SK2 sends SK2 over to SK1 using
   SCM_RIGHTS, and both fds are not consumed on the corresponding receive
   ends, a cycle is created.  When all of SK1, SK2, EP, and ITER are
   closed, SK1's receive queue holds reference to SK2, and SK2's receive
   queue holds reference to ITER, which holds a reference to SK1.
   All file descriptors except EP leak.

To resolve it, we would need to hook into the Unix Socket GC mechanism,
but the alternative of using ep_find_tfd is much more simpler. The
finding of the last position in face of concurrent modification of the
epoll set is at best an approximation anyway. For the case of CRIU, the
epoll set remains stable.

Cc: Alexander Viro <viro@xxxxxxxxxxxxxxxxxx>
Cc: linux-fsdevel@xxxxxxxxxxxxxxx
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
---
  fs/eventpoll.c                 | 196 ++++++++++++++++++++++++++++++++-
  include/linux/bpf.h            |   5 +-
  include/uapi/linux/bpf.h       |   3 +
  tools/include/uapi/linux/bpf.h |   3 +
  4 files changed, 205 insertions(+), 2 deletions(-)

[...]
+
+static const struct bpf_iter_seq_info bpf_epoll_seq_info = {
+	.seq_ops          = &bpf_epoll_seq_ops,
+	.init_seq_private = bpf_epoll_init_seq,
+	.seq_priv_size    = sizeof(struct bpf_epoll_iter_seq_info),
+};
+
+static struct bpf_iter_reg epoll_reg_info = {
+	.target            = "epoll",
+	.feature           = BPF_ITER_RESCHED,
+	.attach_target     = bpf_epoll_iter_attach,
+	.detach_target     = bpf_epoll_iter_detach,

implement show_fdinfo and fill_link_info?

There are some bpftool work needed as well to show the information
in user space. An example is e60495eafdba ("bpftool: Implement 
link_query for bpf iterators").


+	.ctx_arg_info_size = 2,
+	.ctx_arg_info = {
+		{ offsetof(struct bpf_iter__epoll, ep),
+		  PTR_TO_BTF_ID },
+		{ offsetof(struct bpf_iter__epoll, epi),
+		  PTR_TO_BTF_ID_OR_NULL },
+	},
+	.seq_info	   = &bpf_epoll_seq_info,
+};
+
+static int __init epoll_iter_init(void)
+{
+	epoll_reg_info.ctx_arg_info[0].btf_id = btf_epoll_ids[0];
+	epoll_reg_info.ctx_arg_info[1].btf_id = btf_epoll_ids[1];
+	return bpf_iter_reg_target(&epoll_reg_info);
+}
+late_initcall(epoll_iter_init);
+
+#endif

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index fe7b499da781..eb1c9acdc40b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1512,7 +1512,10 @@ int bpf_obj_get_user(const char __user *pathname, int flags);
  struct io_ring_ctx;
  struct bpf_iter_aux_info {
  	struct bpf_map *map;
-	struct io_ring_ctx *ctx;
+	union {
+		struct io_ring_ctx *ctx;
+		struct file *ep;
+	};

You changed to union here. I think we can change
"struct bpf_iter_aux_info" to "union bpf_iter_aux_info".
This should make code simpler and easy to understand.

  };
  
  typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b70e9da3d722..64e18c1dcfca 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -94,6 +94,9 @@ union bpf_iter_link_info {
  	struct {
  		__u32   io_uring_fd;
  	} io_uring;
+	struct {
+		__u32   epoll_fd;
+	} epoll;
  };
  
  /* BPF syscall commands, see bpf(2) man-page for more details. */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index b70e9da3d722..64e18c1dcfca 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -94,6 +94,9 @@ union bpf_iter_link_info {
  	struct {
  		__u32   io_uring_fd;
  	} io_uring;
+	struct {
+		__u32   epoll_fd;
+	} epoll;
  };
  
  /* BPF syscall commands, see bpf(2) man-page for more details. */