Hi Christoph, On 2020/02/21 23:37, Christoph Hellwig wrote: > IOCB_NOWAIT can't just be ignored as it breaks applications expecting > it not to block. Just refuse the operation as applications must handle > that (e.g. by falling back to a thread pool). > > Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > --- > fs/zonefs/super.c | 8 ++++---- > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c > index 8bc6ef82d693..69aee3dfb660 100644 > --- a/fs/zonefs/super.c > +++ b/fs/zonefs/super.c > @@ -601,13 +601,13 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from) > ssize_t ret; > > /* > - * For async direct IOs to sequential zone files, ignore IOCB_NOWAIT > + * For async direct IOs to sequential zone files, refuse IOCB_NOWAIT > * as this can cause write reordering (e.g. the first aio gets EAGAIN > * on the inode lock but the second goes through but is now unaligned). > */ > - if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !is_sync_kiocb(iocb) > - && (iocb->ki_flags & IOCB_NOWAIT)) > - iocb->ki_flags &= ~IOCB_NOWAIT; > + if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !is_sync_kiocb(iocb) && > + (iocb->ki_flags & IOCB_NOWAIT)) > + return -EOPNOTSUPP; > > if (iocb->ki_flags & IOCB_NOWAIT) { > if (!inode_trylock(inode)) > The main problem with allowing IOCB_NOWAIT is that for an application that sends multiple write AIOs with a single io_submit(), all write AIOs after one that is failed with -EAGAIN will be failed with -EINVAL (not -EIO !) since they will have an unaligned write offset. I am wondering if that is really a problem at all since the application can catch the -EAGAIN and -EINVAL, indicating that the write stream was stopped. So maybe it is reasonable to simply allow IOCB_NOWAIT ? Dave did not think it is. I was split on this. Would be happy to hear your opinion. Another solution I was thinking of is something like the hack below, which may be cleaner and easier for the application to handle as that reduces the number of errors for a multi-io io_submit() call with RWF_NOWAIT. To do so, this introduces the IOCB_WRITE_FAIL_FAST kiocb flag that is set in zonefs_file_dio_write() for a nowait kiocb to a sequential zone file. If this function fails (with -EAGAIN or any other error), this flag instruct aio_write() to return the error for the failed kiocb instead of 0. This error return of aio_write() then stops the io_submit() loop on the first failed iocb. diff --git a/fs/aio.c b/fs/aio.c index 5f3d3d814928..9f01541c8ecc 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1568,6 +1568,8 @@ static int aio_write(struct kiocb *req, const struct iocb *iocb, return ret; ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter)); if (!ret) { + ssize_t err; + /* * Open-code file_start_write here to grab freeze protection, * which will be released by another thread in @@ -1580,7 +1582,12 @@ static int aio_write(struct kiocb *req, const struct iocb *iocb, __sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE); } req->ki_flags |= IOCB_WRITE; - aio_rw_done(req, call_write_iter(file, req, &iter)); + err = call_write_iter(file, req, &iter); + if (err != -EIOCBQUEUED) { + aio_rw_done(req, err); + if ((req->ki_flags & IOCB_WRITE_FAIL_FAST) && err < 0) + ret = err; + } } kfree(iovec); return ret; diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c index 8bc6ef82d693..0fa098354e38 100644 --- a/fs/zonefs/super.c +++ b/fs/zonefs/super.c @@ -601,13 +601,14 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from) ssize_t ret; /* - * For async direct IOs to sequential zone files, ignore IOCB_NOWAIT - * as this can cause write reordering (e.g. the first aio gets EAGAIN - * on the inode lock but the second goes through but is now unaligned). + * For async direct IOs to sequential zone files, IOCB_NOWAIT can cause + * unaligned writes in case of EAGAIN error or any failure to issue the + * direct IO. So tell vfs to stop io_submit() on the first failure to + * avoid more failed aios than necessary. */ if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !is_sync_kiocb(iocb) && (iocb->ki_flags & IOCB_NOWAIT)) - iocb->ki_flags &= ~IOCB_NOWAIT; + iocb->ki_flags |= IOCB_WRITE_FAIL_FAST; if (iocb->ki_flags & IOCB_NOWAIT) { if (!inode_trylock(inode)) diff --git a/include/linux/fs.h b/include/linux/fs.h index 3cd4fe6b845e..8e7df1cc92e4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -314,6 +314,7 @@ enum rw_hint { #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) #define IOCB_NOWAIT (1 << 7) +#define IOCB_WRITE_FAIL_FAST (1 << 8) struct kiocb { struct file *ki_filp; Of note is that checking the code path for iomap_direct_io(), I noticed that there are at least 2 places places where the code can block: dio allocation with GFP_KERNEL and the BIOs allocations for that dio also with GFP_KERNEL. While the latter may be necessary to avoid partial failures of large direct IOs that need to be split into multiple BIOs, shouldn't the dio allocation be done with GFP_NOWAIT when IOMAP_NOWAIT is set ? Best regards. -- Damien Le Moal Western Digital Research