Re: [PATCH 5/5] btrfs: add io_uring command for encoded reads

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/31/24 17:08, Mark Harmstone wrote:
Thanks Pavel.

...
If that's an iovec backed iter that might fail, you'd need to
steal this patch

https://lore.kernel.org/all/20241016-fuse-uring-for-6-10-rfc4-v4-12-9739c753666e@xxxxxxx/

and fail if "issue_flags & IO_URING_F_TASK_DEAD", see

https://lore.kernel.org/all/20241016-fuse-uring-for-6-10-rfc4-v4-13-9739c753666e@xxxxxxx/

Thanks, I've sent a patchset including your patch. Does it make a
difference, though? If the task has died, presumably copy_page_to_iter
can't copy to another process' memory...?

IIRC copy_to_user will crash without mm set, not sure about
copy_page_to_iter(). Regardless, when the original task has dies
and it gets run from io_uring's fallback path, you shouldn't
make any assumptions about the current task.

+            ret = -EFAULT;
+            goto out;
+        }
+
+        i++;
+        cur += bytes;
+        page_offset = 0;
+    }
+    ret = priv->count;
+
+out:
+    unlock_extent(io_tree, priv->start, priv->lockend,
&priv->cached_state);
+    btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);

When called via io_uring_cmd_complete_in_task() this function might
not get run in any reasonable amount of time. Even worse, a
misbehaving user can block it until the task dies.

I don't remember if rwsem allows unlock being executed in a different
task than the pairing lock, but blocking it for that long could be a
problem. I might not remember it right but I think Boris meantioned
that the O_DIRECT path drops the inode lock right after submission
without waiting for bios to complete. Is that right? Can we do it
here as well?

We can't release the inode lock until we've released the extent lock. I
do intend to look into reducing the amount of time we hold the extent
lock, if we can, but it's not trivial to do this in a safe manner.

I lack the btrfs knowledge, but sounds like it can be done the
same way the async dio path works.

We could perhaps move the unlocking to btrfs_uring_read_extent_endio
instead, but it looks unlocking an rwsem in a different context might
cause problems with PREEMPT_RT(?).

At least from a quick glance it doesn't seem that locks in
__clear_extent_bit are [soft]irq protected. Would be a good
idea to give it a run with lockdep enabled.


...
+    ret = btrfs_encoded_read_regular_fill_pages(inode, disk_bytenr,
+                            disk_io_size, pages,
+                            priv);
+    if (ret && ret != -EIOCBQUEUED)
+        goto fail;

Turning both into return EIOCBQUEUED is a bit suspicious, but
I lack context to say. Might make sense to return ret and let
the caller handle it.

btrfs_encoded_read_regular_fill_pages returns 0 if the bio completes
before the function can finish, -EIOCBQUEUED otherwise. In either case
the behaviour of the calling function will be the same.

Ok

...
+    if (copy_to_user(sqe_addr + copy_end, (char *)&args +
copy_end_kernel,
+             sizeof(args) - copy_end_kernel)) {
+        if (ret == -EIOCBQUEUED) {
+            unlock_extent(io_tree, start, lockend, &cached_state);
+            btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+        }
+        ret = -EFAULT;
+        goto out_free;

It seems we're saving iov in the priv structure, who can access the iovec
after the request is submitted? -EIOCBQUEUED in general means that the
request is submitted and will get completed async, e.g. via callback, and
if the bio callback can use the iov maybe via the iter, this goto will be
a use after free.

Also, you're returning -EFAULT back to io_uring, which will kill the
io_uring request / cmd while there might still be in flight bios that
can try to access it.

Can you inject errors into the copy and test please?

The bio hasn't been submitted at this point, that happens in
btrfs_uring_read_extent. So far all we've done is read the metadata from
the page cache. The copy_to_user here is copying the metadata info to
the userspace structure.

I see, in this case it should be fine, but why is it -EIOCBQUEUED
then? It always meant that it queued up the request and will
complete it asynchronously, and that's where the confusion sprouted
from. Not looking deeper but sounds more like -EAGAIN? Assuming it's
returned because we can't block

+    }
+
+    if (ret == -EIOCBQUEUED) {
+        u64 count;
+
+        /*
+         * If we've optimized things by storing the iovecs on the stack,
+         * undo this.
+         */> +        if (!iov) {
+            iov = kmalloc(sizeof(struct iovec) * args.iovcnt,
+                      GFP_NOFS);
+            if (!iov) {
+                unlock_extent(io_tree, start, lockend,
+                          &cached_state);
+                btrfs_inode_unlock(inode, BTRFS_ILOCK_SHARED);
+                ret = -ENOMEM;
+                goto out_acct;
+            }
+
+            memcpy(iov, iovstack,
+                   sizeof(struct iovec) * args.iovcnt);

As an optimisation in the future you can allocate it
together with the btrfs_priv structure.

+        }
+
+        count = min_t(u64, iov_iter_count(&iter), disk_io_size);
+
+        /* Match ioctl by not returning past EOF if uncompressed */
+        if (!args.compression)
+            count = min_t(u64, count, args.len);
+
+        ret = btrfs_uring_read_extent(&kiocb, &iter, start, lockend,
+                          cached_state, disk_bytenr,
+                          disk_io_size, count,
+                          args.compression, iov, cmd);

So that's the only spot where asynchronous code branches off
in this function? Do I read you right?

+
+        goto out_acct;
+    }
+
+out_free:
+    kfree(iov);
+
+out_acct:
+    if (ret > 0)
+        add_rchar(current, ret);
+    inc_syscr(current);
+
+    return ret;
+}

--
Pavel Begunkov




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux