Re: [PATCH] zonefs: Improve error handling

Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> · Thu, 15 Feb 2024 07:56:56 +0000

On Feb 14, 2024 / 15:45, Damien Le Moal wrote:
> Write error handling is racy and can sometime lead to the error recovery
> path wrongly changing the inode size of a sequential zone file to an
> incorrect value  which results in garbage data being readable at the end
> of a file. There are 2 problems:
> 
> 1) zonefs_file_dio_write() updates a zone file write pointer offset
>    after issuing a direct IO with iomap_dio_rw(). This update is done
>    only if the IO succeed for synchronous direct writes. However, for
>    asynchronous direct writes, the update is done without waiting for
>    the IO completion so that the next asynchronous IO can be
>    immediately issued. However, if an asynchronous IO completes with a
>    failure right before the i_truncate_mutex lock protecting the update,
>    the update may change the value of the inode write pointer offset
>    that was corrected by the error path (zonefs_io_error() function).
> 
> 2) zonefs_io_error() is called when a read or write error occurs. This
>    function executes a report zone operation using the callback function
>    zonefs_io_error_cb(), which does all the error recovery handling
>    based on the current zone condition, write pointer position and
>    according to the mount options being used. However, depending on the
>    zoned device being used, a report zone callback may be executed in a
>    context that is different from the context of __zonefs_io_error(). As
>    a result, zonefs_io_error_cb() may be executed without the inode
>    truncate mutex lock held, which can lead to invalid error processing.
> 
> Fix both problems as follows:
> - Problem 1: Perform the inode write pointer offset update before a
>   direct write is issued with iomap_dio_rw(). This is safe to do as
>   partial direct writes are not supported (IOMAP_DIO_PARTIAL is not
>   set) and any failed IO will trigger the execution of zonefs_io_error()
>   which will correct the inode write pointer offset to reflect the
>   current state of the one on the device.
> - Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error()
>   and call this function directly from __zonefs_io_error() after
>   obtaining the zone information using blkdev_report_zones() with a
>   simple callback function that copies to a local stack variable the
>   struct blk_zone obtained from the device. This ensures that error
>   handling is performed holding the inode truncate mutex.
>   This change also simplifies error handling for conventional zone files
>   by bypassing the execution of report zones entirely. This is safe to
>   do because the condition of conventional zones cannot be read-only or
>   offline and conventional zone files are always fully mapped with a
>   constant file size.
> 
> Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx>
> Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx>

I ran my test set applying this patch on top of the kernel v6.8-rc4 and
confirmed it fixes the problems. Also I observed no regression. Good.

Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx>