Re: [PATCH 3/3] io_uring: add support for getdents

Hao Xu <hao.xu@xxxxxxxxx> · Thu, 13 Jul 2023 12:35:07 +0800

On 7/12/23 23:27, Christian Brauner wrote:
On Tue, Jul 11, 2023 at 07:40:27PM +0800, Hao Xu wrote:
From: Hao Xu <howeyxu@xxxxxxxxxxx>

This add support for getdents64 to io_uring, acting exactly like the
syscall: the directory is iterated from it's current's position as
stored in the file struct, and the file's position is updated exactly as
if getdents64 had been called.

For filesystems that support NOWAIT in iterate_shared(), try to use it
first; if a user already knows the filesystem they use do not support
nowait they can force async through IOSQE_ASYNC in the sqe flags,
avoiding the need to bounce back through a useless EAGAIN return.

Co-developed-by: Dominique Martinet <asmadeus@xxxxxxxxxxxxx>
Signed-off-by: Dominique Martinet <asmadeus@xxxxxxxxxxxxx>
Signed-off-by: Hao Xu <howeyxu@xxxxxxxxxxx>
---
  include/uapi/linux/io_uring.h |  7 ++++
  io_uring/fs.c                 | 60 +++++++++++++++++++++++++++++++++++
  io_uring/fs.h                 |  3 ++
  io_uring/opdef.c              |  8 +++++
  4 files changed, 78 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 08720c7bd92f..6c0d521135a6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,6 +65,7 @@ struct io_uring_sqe {
  		__u32		xattr_flags;
  		__u32		msg_ring_flags;
  		__u32		uring_cmd_flags;
+		__u32		getdents_flags;
  	};
  	__u64	user_data;	/* data to be passed back at completion time */
  	/* pack this to avoid bogus arm OABI complaints */
@@ -235,6 +236,7 @@ enum io_uring_op {
  	IORING_OP_URING_CMD,
  	IORING_OP_SEND_ZC,
  	IORING_OP_SENDMSG_ZC,
+	IORING_OP_GETDENTS,
  
  	/* this goes last, obviously */
  	IORING_OP_LAST,
@@ -273,6 +275,11 @@ enum io_uring_op {
   */
  #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
  
+/*
+ * sqe->getdents_flags
+ */
+#define IORING_GETDENTS_REWIND	(1U << 0)
+
  /*
   * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
   * command flags for POLL_ADD are stored in sqe->len.
diff --git a/io_uring/fs.c b/io_uring/fs.c
index f6a69a549fd4..77f00577e09c 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -47,6 +47,13 @@ struct io_link {
  	int				flags;
  };
  
+struct io_getdents {
+	struct file			*file;
+	struct linux_dirent64 __user	*dirent;
+	unsigned int			count;
+	int				flags;
+};
+
  int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  {
  	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
@@ -291,3 +298,56 @@ void io_link_cleanup(struct io_kiocb *req)
  	putname(sl->oldpath);
  	putname(sl->newpath);
  }
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+
+	if (READ_ONCE(sqe->off) != 0)
+		return -EINVAL;
+
+	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	gd->count = READ_ONCE(sqe->len);
+
+	return 0;
+}
+
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+	struct file *file;
+	unsigned long getdents_flags = 0;
+	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
+	bool should_lock = false;
+	int ret;
+
+	if (force_nonblock) {
+		if (!(req->file->f_mode & FMODE_NOWAIT))
+			return -EAGAIN;
+
+		getdents_flags = DIR_CONTEXT_F_NOWAIT;

I mentioned this on the other patch but it seems really pointless to
have that extra flag. I would really like to hear a good reason for
this.

+	}
+
+	file = req->file;
+	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
+		if (file_count(file) > 1)

Assume we have a regular non-threaded process that just opens an fd to a
file. The process registers an async readdir request via that fd for the
file with io_uring and goes to do other stuff while waiting for the
result.

Some time later, io_uring gets to io_getdents() and the task is still
single threaded and the file hasn't been shared in the meantime. So
io_getdents() doesn't take the lock and starts the readdir() call.

Concurrently, the process that registered the io_uring request was free
to do other stuff and issued a synchronous readdir() system call which
calls fdget_pos(). Since the fdtable still isn't shared it doesn't
increment f_count and doesn't acquire the mutex. Now there's another
concurrent readdir() going on.

(Similar thing can happen if the process creates a thread for example.)

Two readdir() requests now proceed concurrently which is not intended.
Now to verify that this race can't happen with io_uring:

* regular fds:
   It seems that io_uring calls fget() on each regular file descriptor
   when an async request is registered. So that means that io_uring
   always hold its own explicit reference here.
   So as long as the original task is alive or another thread is alive
   f_count is guaranteed to be > 1 and so the mutex would always be
   acquired.

   If the registering process dies right before io_uring gets to the
   io_getdents() request no other process can steal the fd anymore and in
   that case the readdir call would not lock. But that's fine.

* fixed fds:
   I don't know the reference counting rules here. io_uring would need to
   ensure that it's impossible for two async readdir requests via a fixed
   fd to race because f_count is == 1.

   Iiuc, if a process registers a file it opened as a fixed file and
   immediately closes the fd afterwards - without anyone else holding a
   reference to that file - and only uses the fixed fd going forward, the
   f_count of that file in io_uring's fixed file table is always 1.

   So one could issue any number of concurrent readdir requests with no
   mutual exclusion. So for fixed files there definitely is a race, no?

Hi Christian,
The ref logic for fixed file is that it does fdget() when registering
the file, and fdput() when unregistering it. So the ref in between is
always > 1. The fixed file feature is to reduce frequent fdget/fdput,
but it does call them at the register/unregister time.



All of that could ofc be simplified if we could just always acquire the
mutex in fdget_pos() and other places and drop that file_count(file) > 1
optimization everywhere. But I have no idea if the optimization for not
acquiring the mutex if f_count == 1 is worth it?

I hope I didn't confuse myself here.

Jens, do yo have any input here?

+			should_lock = true;
+	}
+	if (should_lock) {
+		if (!force_nonblock)
+			mutex_lock(&file->f_pos_lock);
+		else if (!mutex_trylock(&file->f_pos_lock))
+			return -EAGAIN;
+	}

Open-coding this seems extremely brittle with an invitation for subtle
bugs.

Could you elaborate on this, I'm not sure that I understand it quite
well. Sorry for my poor English.

Thanks,
Hao


+
+	ret = vfs_getdents(file, gd->dirent, gd->count, getdents_flags);
+	if (should_lock)
+		mutex_unlock(&file->f_pos_lock);
+
+	if (ret == -EAGAIN && force_nonblock)
+		return -EAGAIN;
+
+	io_req_set_res(req, ret, 0);
+	return 0;
+}
+
diff --git a/io_uring/fs.h b/io_uring/fs.h
index 0bb5efe3d6bb..f83a6f3a678d 100644
--- a/io_uring/fs.h
+++ b/io_uring/fs.h
@@ -18,3 +18,6 @@ int io_symlinkat(struct io_kiocb *req, unsigned int issue_flags);
  int io_linkat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
  int io_linkat(struct io_kiocb *req, unsigned int issue_flags);
  void io_link_cleanup(struct io_kiocb *req);
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe);
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags);
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 3b9c6489b8b6..1bae6b2a8d0b 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -428,6 +428,11 @@ const struct io_issue_def io_issue_defs[] = {
  		.prep			= io_eopnotsupp_prep,
  #endif
  	},
+	[IORING_OP_GETDENTS] = {
+		.needs_file		= 1,
+		.prep			= io_getdents_prep,
+		.issue			= io_getdents,
+	},
  };
  
  
@@ -648,6 +653,9 @@ const struct io_cold_def io_cold_defs[] = {
  		.fail			= io_sendrecv_fail,
  #endif
  	},
+	[IORING_OP_GETDENTS] = {
+		.name			= "GETDENTS",
+	},
  };
  
  const char *io_uring_get_opcode(u8 opcode)
--
2.25.1