We want to move kiocb_start_write() into vfs_iocb_iter_write(), after the permission hook, but leave kiocb_end_write() in the write completion handler of the callers of vfs_iocb_iter_write(). After this change, there will be no way of knowing in completion handler, if write has failed before or after calling kiocb_start_write(). Add a flag IOCB_WRITE_STARTED, which is set and cleared internally by kiocb_{start,end}_write(), so that kiocb_end_write() could be called for cleanup of async write, whether it was successful or whether it failed before or after calling kiocb_start_write(). This flag must not be copied by stacked filesystems (e.g. overlayfs) that clone the iocb to another iocb for io request on a backing file. Link: https://lore.kernel.org/r/CAOQ4uxihfJJRxxUhAmOwtD97Lg8PL8RgXw88rH1UfEeP8AtP+w@xxxxxxxxxxxxxx/ Signed-off-by: Amir Goldstein <amir73il@xxxxxxxxx> --- Jens, The kiocb_{start,end}_write() helpers were originally taken out of an early version of the permission hooks cleanup patches [1]. This early version of the helpers had the IOCB_WRITE_STARTED flag, but when I posted the helpers independently from the cleanup series, you had correctly pointed out [2] that IOCB_WRITE_STARTED is not needed for any of the existing callers of kiocb_{start,end}_write(), so I removed it for the final version of the helpers that got merged. When coming back to the permission hook cleanup, I see that the IOCB_WRITE_STARTED flag is needed to allow the new semantics of calling kiocb_start_write() inside vfs_iocb_iter_write() and kiocb_end_write() in completion callback. I realize these semantics are not great, but the alternative of moving the permission hook from vfs_iocb_iter_write() into all the callers is worse IMO. Can you accept the solution with IOCB_WRITE_STARTED state flag? Have a better idea for cleaner iocb issue semantics that will allow calling the permission hook with freeze protection held? Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/20231114153321.1716028-1-amir73il@xxxxxxxxx/ [2] https://lore.kernel.org/linux-fsdevel/e6948836-1d9d-4219-9a21-a2ab442a9a34@xxxxxxxxx/ fs/overlayfs/file.c | 2 +- include/linux/fs.h | 18 ++++++++++++++++-- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index 131621daeb13..e4baa4ea5c95 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -455,7 +455,7 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter) aio_req->orig_iocb = iocb; kiocb_clone(&aio_req->iocb, iocb, get_file(real.file)); - aio_req->iocb.ki_flags = ifl; + aio_req->iocb.ki_flags &= ifl; aio_req->iocb.ki_complete = ovl_aio_queue_completion; refcount_set(&aio_req->ref, 2); kiocb_start_write(&aio_req->iocb); diff --git a/include/linux/fs.h b/include/linux/fs.h index 98b7a7a8c42e..64414e146e1e 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -352,6 +352,8 @@ enum rw_hint { * unrelated IO (like cache flushing, new IO generation, etc). */ #define IOCB_DIO_CALLER_COMP (1 << 22) +/* file_start_write() was called */ +#define IOCB_WRITE_STARTED (1 << 23) /* for use in trace events */ #define TRACE_IOCB_STRINGS \ @@ -366,7 +368,8 @@ enum rw_hint { { IOCB_WAITQ, "WAITQ" }, \ { IOCB_NOIO, "NOIO" }, \ { IOCB_ALLOC_CACHE, "ALLOC_CACHE" }, \ - { IOCB_DIO_CALLER_COMP, "CALLER_COMP" } + { IOCB_DIO_CALLER_COMP, "CALLER_COMP" }, \ + { IOCB_WRITE_STARTED, "WRITE_STARTED" } struct kiocb { struct file *ki_filp; @@ -2183,12 +2186,15 @@ static inline void init_sync_kiocb(struct kiocb *kiocb, struct file *filp) }; } +/* IOCB flags associated with ki_filp state must not be cloned */ +#define IOCB_CLONE_FLAGS_MASK (~IOCB_WRITE_STARTED) + static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, struct file *filp) { *kiocb = (struct kiocb) { .ki_filp = filp, - .ki_flags = kiocb_src->ki_flags, + .ki_flags = kiocb_src->ki_flags & IOCB_CLONE_FLAGS_MASK, .ki_ioprio = kiocb_src->ki_ioprio, .ki_pos = kiocb_src->ki_pos, }; @@ -2744,12 +2750,16 @@ static inline void kiocb_start_write(struct kiocb *iocb) { struct inode *inode = file_inode(iocb->ki_filp); + if (WARN_ON_ONCE(iocb->ki_flags & IOCB_WRITE_STARTED)) + return; + sb_start_write(inode->i_sb); /* * Fool lockdep by telling it the lock got released so that it * doesn't complain about the held lock when we return to userspace. */ __sb_writers_release(inode->i_sb, SB_FREEZE_WRITE); + iocb->ki_flags |= IOCB_WRITE_STARTED; } /** @@ -2762,11 +2772,15 @@ static inline void kiocb_end_write(struct kiocb *iocb) { struct inode *inode = file_inode(iocb->ki_filp); + if (!(iocb->ki_flags & IOCB_WRITE_STARTED)) + return; + /* * Tell lockdep we inherited freeze protection from submission thread. */ __sb_writers_acquired(inode->i_sb, SB_FREEZE_WRITE); sb_end_write(inode->i_sb); + iocb->ki_flags &= ~IOCB_WRITE_STARTED; } /* -- 2.34.1