Re: [PATCH 3/5] io_uring: add support for getdents

Hao Xu <hao.xu@xxxxxxxxx> · Thu, 27 Jul 2023 19:51:19 +0800

On 7/26/23 23:00, Christian Brauner wrote:
On Tue, Jul 18, 2023 at 09:21:10PM +0800, Hao Xu wrote:
From: Hao Xu <howeyxu@xxxxxxxxxxx>

This add support for getdents64 to io_uring, acting exactly like the
syscall: the directory is iterated from it's current's position as
stored in the file struct, and the file's position is updated exactly as
if getdents64 had been called.

For filesystems that support NOWAIT in iterate_shared(), try to use it
first; if a user already knows the filesystem they use do not support
nowait they can force async through IOSQE_ASYNC in the sqe flags,
avoiding the need to bounce back through a useless EAGAIN return.

Co-developed-by: Dominique Martinet <asmadeus@xxxxxxxxxxxxx>
Signed-off-by: Dominique Martinet <asmadeus@xxxxxxxxxxxxx>
Signed-off-by: Hao Xu <howeyxu@xxxxxxxxxxx>
---
  include/uapi/linux/io_uring.h |  7 +++++
  io_uring/fs.c                 | 55 +++++++++++++++++++++++++++++++++++
  io_uring/fs.h                 |  3 ++
  io_uring/opdef.c              |  8 +++++
  4 files changed, 73 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 36f9c73082de..b200b2600622 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -65,6 +65,7 @@ struct io_uring_sqe {
  		__u32		xattr_flags;
  		__u32		msg_ring_flags;
  		__u32		uring_cmd_flags;
+		__u32		getdents_flags;
  	};
  	__u64	user_data;	/* data to be passed back at completion time */
  	/* pack this to avoid bogus arm OABI complaints */
@@ -235,6 +236,7 @@ enum io_uring_op {
  	IORING_OP_URING_CMD,
  	IORING_OP_SEND_ZC,
  	IORING_OP_SENDMSG_ZC,
+	IORING_OP_GETDENTS,
  
  	/* this goes last, obviously */
  	IORING_OP_LAST,
@@ -273,6 +275,11 @@ enum io_uring_op {
   */
  #define SPLICE_F_FD_IN_FIXED	(1U << 31) /* the last bit of __u32 */
  
+/*
+ * sqe->getdents_flags
+ */
+#define IORING_GETDENTS_REWIND	(1U << 0)
+
  /*
   * POLL_ADD flags. Note that since sqe->poll_events is the flag space, the
   * command flags for POLL_ADD are stored in sqe->len.
diff --git a/io_uring/fs.c b/io_uring/fs.c
index f6a69a549fd4..480f25677fed 100644
--- a/io_uring/fs.c
+++ b/io_uring/fs.c
@@ -47,6 +47,13 @@ struct io_link {
  	int				flags;
  };
  
+struct io_getdents {
+	struct file			*file;
+	struct linux_dirent64 __user	*dirent;
+	unsigned int			count;
+	int				flags;
+};
+
  int io_renameat_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
  {
  	struct io_rename *ren = io_kiocb_to_cmd(req, struct io_rename);
@@ -291,3 +298,51 @@ void io_link_cleanup(struct io_kiocb *req)
  	putname(sl->oldpath);
  	putname(sl->newpath);
  }
+
+int io_getdents_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+
+	if (READ_ONCE(sqe->off) != 0)
+		return -EINVAL;
+
+	gd->dirent = u64_to_user_ptr(READ_ONCE(sqe->addr));
+	gd->count = READ_ONCE(sqe->len);
+
+	return 0;
+}
+
+int io_getdents(struct io_kiocb *req, unsigned int issue_flags)
+{
+	struct io_getdents *gd = io_kiocb_to_cmd(req, struct io_getdents);
+	struct file *file = req->file;
+	unsigned long getdents_flags = 0;
+	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;

Hm, I'm not sure what exactly the rules are for IO_URING_F_NONBLOCK.
But to point this out:

vfs_getdents()
-> iterate_dir()
    {
         if (shared)
                 res = down_read_killable(&inode->i_rwsem);
         else
                 res = down_write_killable(&inode->i_rwsem);
    }

which means you can still end up sleeping here before you go into a
filesystem that does actually support non-waiting getdents. So if you
have concurrent operations that grab inode lock (touch, mkdir etc) you
can end up sleeping here.

Is that intentional or an oversight? If the former can someone please
explain the rules and why it's fine in this case?

I actually saw this semaphore, and there is another xfs lock in
file_accessed
  --> touch_atime
    --> inode_update_time
      --> inode->i_op->update_time == xfs_vn_update_time

Forgot to point them out in the cover-letter..., I didn't modify them
since I'm not very sure about if we should do so, and I saw Stefan's
patchset didn't modify them too.

My personnal thinking is we should apply trylock logic for this
inode->i_rwsem. For xfs lock in touch_atime, we should do that since it
doesn't make sense to rollback all the stuff while we are almost at the
end of getdents because of a lock.



+	bool should_lock = file->f_mode & FMODE_ATOMIC_POS;
+	int ret;
+
+	if (force_nonblock) {
+		if (!(req->file->f_mode & FMODE_NOWAIT))
+			return -EAGAIN;
+
+		getdents_flags = DIR_CONTEXT_F_NOWAIT;
+	}
+
+	if (should_lock) {
+		if (!force_nonblock)
+			mutex_lock(&file->f_pos_lock);
+		else if (!mutex_trylock(&file->f_pos_lock))
+			return -EAGAIN;
+	}

That now looks like it works.

+
+	ret = vfs_getdents(file, gd->dirent, gd->count, getdents_flags);
+	if (should_lock)
+		mutex_unlock(&file->f_pos_lock);