Re: [PATCH] zonefs: fix IOCB_NOWAIT handling

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Mon, 24 Feb 2020 13:48:48 +0000

Hi Christoph,

On 2020/02/21 23:37, Christoph Hellwig wrote:
> IOCB_NOWAIT can't just be ignored as it breaks applications expecting
> it not to block.  Just refuse the operation as applications must handle
> that (e.g. by falling back to a thread pool).
> 
> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
> Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> ---
>  fs/zonefs/super.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
> index 8bc6ef82d693..69aee3dfb660 100644
> --- a/fs/zonefs/super.c
> +++ b/fs/zonefs/super.c
> @@ -601,13 +601,13 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
>  	ssize_t ret;
>  
>  	/*
> -	 * For async direct IOs to sequential zone files, ignore IOCB_NOWAIT
> +	 * For async direct IOs to sequential zone files, refuse IOCB_NOWAIT
>  	 * as this can cause write reordering (e.g. the first aio gets EAGAIN
>  	 * on the inode lock but the second goes through but is now unaligned).
>  	 */
> -	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !is_sync_kiocb(iocb)
> -	    && (iocb->ki_flags & IOCB_NOWAIT))
> -		iocb->ki_flags &= ~IOCB_NOWAIT;
> +	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !is_sync_kiocb(iocb) &&
> +	    (iocb->ki_flags & IOCB_NOWAIT))
> +		return -EOPNOTSUPP;
>  
>  	if (iocb->ki_flags & IOCB_NOWAIT) {
>  		if (!inode_trylock(inode))
> 

The main problem with allowing IOCB_NOWAIT is that for an application that sends
multiple write AIOs with a single io_submit(), all write AIOs after one that is
failed with -EAGAIN will be failed with -EINVAL (not -EIO !) since they will
have an unaligned write offset. I am wondering if that is really a problem at
all since the application can catch the -EAGAIN and -EINVAL, indicating that the
write stream was stopped. So maybe it is reasonable to simply allow IOCB_NOWAIT ?

Dave did not think it is. I was split on this. Would be happy to hear your
opinion. Another solution I was thinking of is something like the hack below,
which may be cleaner and easier for the application to handle as that reduces
the number of errors for a multi-io io_submit() call with RWF_NOWAIT. To do so,
this introduces the IOCB_WRITE_FAIL_FAST kiocb flag that is set in
zonefs_file_dio_write() for a nowait kiocb to a sequential zone file. If this
function fails (with -EAGAIN or any other error), this flag instruct aio_write()
to return the error for the failed kiocb instead of 0. This error return of
aio_write() then stops the io_submit() loop on the first failed iocb.

diff --git a/fs/aio.c b/fs/aio.c
index 5f3d3d814928..9f01541c8ecc 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1568,6 +1568,8 @@ static int aio_write(struct kiocb *req, const struct iocb
*iocb,
 		return ret;
 	ret = rw_verify_area(WRITE, file, &req->ki_pos, iov_iter_count(&iter));
 	if (!ret) {
+		ssize_t err;
+
 		/*
 		 * Open-code file_start_write here to grab freeze protection,
 		 * which will be released by another thread in
@@ -1580,7 +1582,12 @@ static int aio_write(struct kiocb *req, const struct iocb
*iocb,
 			__sb_writers_release(file_inode(file)->i_sb, SB_FREEZE_WRITE);
 		}
 		req->ki_flags |= IOCB_WRITE;
-		aio_rw_done(req, call_write_iter(file, req, &iter));
+		err = call_write_iter(file, req, &iter);
+		if (err != -EIOCBQUEUED) {
+		    aio_rw_done(req, err);
+		    if ((req->ki_flags & IOCB_WRITE_FAIL_FAST) && err < 0)
+			    ret = err;
+		}
 	}
 	kfree(iovec);
 	return ret;
diff --git a/fs/zonefs/super.c b/fs/zonefs/super.c
index 8bc6ef82d693..0fa098354e38 100644
--- a/fs/zonefs/super.c
+++ b/fs/zonefs/super.c
@@ -601,13 +601,14 @@ static ssize_t zonefs_file_dio_write(struct kiocb *iocb,
struct iov_iter *from)
 	ssize_t ret;

 	/*
-	 * For async direct IOs to sequential zone files, ignore IOCB_NOWAIT
-	 * as this can cause write reordering (e.g. the first aio gets EAGAIN
-	 * on the inode lock but the second goes through but is now unaligned).
+	 * For async direct IOs to sequential zone files, IOCB_NOWAIT can cause
+	 * unaligned writes in case of EAGAIN error or any failure to issue the
+	 * direct IO. So tell vfs to stop io_submit() on the first failure to
+	 * avoid more failed aios than necessary.
 	 */
 	if (zi->i_ztype == ZONEFS_ZTYPE_SEQ && !is_sync_kiocb(iocb)
 	    && (iocb->ki_flags & IOCB_NOWAIT))
-		iocb->ki_flags &= ~IOCB_NOWAIT;
+		iocb->ki_flags |= IOCB_WRITE_FAIL_FAST;

 	if (iocb->ki_flags & IOCB_NOWAIT) {
 		if (!inode_trylock(inode))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3cd4fe6b845e..8e7df1cc92e4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -314,6 +314,7 @@ enum rw_hint {
 #define IOCB_SYNC		(1 << 5)
 #define IOCB_WRITE		(1 << 6)
 #define IOCB_NOWAIT		(1 << 7)
+#define IOCB_WRITE_FAIL_FAST	(1 << 8)

 struct kiocb {
 	struct file		*ki_filp;


Of note is that checking the code path for iomap_direct_io(), I noticed that
there are at least 2 places places where the code can block: dio allocation with
GFP_KERNEL and the BIOs allocations for that dio also with GFP_KERNEL. While the
latter may be necessary to avoid partial failures of large direct IOs that need
to be split into multiple BIOs, shouldn't the dio allocation be done with
GFP_NOWAIT when IOMAP_NOWAIT is set ?

Best regards.


-- 
Damien Le Moal
Western Digital Research