On Feb 14, 2024 / 15:45, Damien Le Moal wrote: > Write error handling is racy and can sometime lead to the error recovery > path wrongly changing the inode size of a sequential zone file to an > incorrect value which results in garbage data being readable at the end > of a file. There are 2 problems: > > 1) zonefs_file_dio_write() updates a zone file write pointer offset > after issuing a direct IO with iomap_dio_rw(). This update is done > only if the IO succeed for synchronous direct writes. However, for > asynchronous direct writes, the update is done without waiting for > the IO completion so that the next asynchronous IO can be > immediately issued. However, if an asynchronous IO completes with a > failure right before the i_truncate_mutex lock protecting the update, > the update may change the value of the inode write pointer offset > that was corrected by the error path (zonefs_io_error() function). > > 2) zonefs_io_error() is called when a read or write error occurs. This > function executes a report zone operation using the callback function > zonefs_io_error_cb(), which does all the error recovery handling > based on the current zone condition, write pointer position and > according to the mount options being used. However, depending on the > zoned device being used, a report zone callback may be executed in a > context that is different from the context of __zonefs_io_error(). As > a result, zonefs_io_error_cb() may be executed without the inode > truncate mutex lock held, which can lead to invalid error processing. > > Fix both problems as follows: > - Problem 1: Perform the inode write pointer offset update before a > direct write is issued with iomap_dio_rw(). This is safe to do as > partial direct writes are not supported (IOMAP_DIO_PARTIAL is not > set) and any failed IO will trigger the execution of zonefs_io_error() > which will correct the inode write pointer offset to reflect the > current state of the one on the device. > - Problem 2: Change zonefs_io_error_cb() into zonefs_handle_io_error() > and call this function directly from __zonefs_io_error() after > obtaining the zone information using blkdev_report_zones() with a > simple callback function that copies to a local stack variable the > struct blk_zone obtained from the device. This ensures that error > handling is performed holding the inode truncate mutex. > This change also simplifies error handling for conventional zone files > by bypassing the execution of report zones entirely. This is safe to > do because the condition of conventional zones cannot be read-only or > offline and conventional zone files are always fully mapped with a > constant file size. > > Reported-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> > Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system") > Cc: stable@xxxxxxxxxxxxxxx > Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx> I ran my test set applying this patch on top of the kernel v6.8-rc4 and confirmed it fixes the problems. Also I observed no regression. Good. Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@xxxxxxx>