On 8/20/23 7:26 AM, Jens Axboe wrote: > On 8/19/23 6:22 PM, Qu Wenruo wrote: >> >> >> On 2023/8/20 07:59, Qu Wenruo wrote: >>> Hi Jens >>> >>> I tried more on my side to debug the situation, and found a very weird >>> write behavior: >>> >>> Some unexpected direct IO happened, without corresponding >>> fsstress workload. >>> >>> The workload is: >>> >>> $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress >>> >>> Which I can reliably reproduce the problem locally, around 1/50 >>> possibility. >>> In my particular case, it results data corruption at root 5 inode 283 >>> offset 8192. >>> >>> Then I added some trace points for the following functions: >>> >>> - btrfs_do_write_iter() >>> Two trace points, one before btrfs_direct_write(), and one >>> before btrfs_buffered_write(), outputting the aligned and unaligned >>> write range, root/inode number, type of the write (buffered or >>> direct). >>> >>> - btrfs_finish_one_ordered() >>> This is where btrfs inserts its ordered extent into the subvolume >>> tree. >>> This happens when a range of pages finishes its writeback. >>> >>> Then here comes the fsstress log for inode 283 (no btrfs root number): >>> >>> 0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >>> [307200,0] >>> 0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0] >>> [1058819,0] >>> 0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0 >>> 0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736 >>> 585728 95 >>> 0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0 >>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >>> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320 >>> 1457078] return 25, fallback to stat() >>> 0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320 >>> 1457078] return 25, fallback to stat() >>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >>> 0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496 >>> 1457078] return 25, fallback to stat() >>> 0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0 >>> 0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0 >>> 0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t >>> 3512660 81075 0 >>> 0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2 >>> 308134 1763236 856 3593735] [5603798,59420] 0 >>> 0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976 >>> 5663218]t 1361821 480392 0 >>> 0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248] >>> -> d0/f5[293 1 0 0 1872 2678784] [942080,53248] >>> >>> Note one thing, there is no direct/buffered write into inode 283 offset >>> 8192. >>> >>> But from the trace events for root 5 inode 283: >>> >>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644) >>> len=90112(88327) >>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622) >>> len=61440(56456) >>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121) >>> len=12288(7712) >>> >>> btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192) >>> len=73728(73728) <<<<< >>> >>> btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824) >>> len=16384(16384) >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384 >>> btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864) >>> len=36864(36864) >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864 >>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000) >>> len=61440(60139) >>> btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798) >>> len=61440(59420) >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440 >>> btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440 >>> >>> Note that phantom direct IO call, which is in the corrupted range. >>> >>> If paired with fsstress, that phantom write happens between the two >>> operations: >>> >>> 0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0 >>> 0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0 >> >> Just to be more accurate, there is a 0/33 operation, which is: >> >> 0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320 >> 1457078] return 25, fallback to stat() >> 0/33: awrite - io_getevents failed -4 >> >> The failed one doesn't have inode number thus it didn't get caught by grep. >> >> Return value -4 means -INTR, not sure who sent the interruption. >> But if this interruption happens before the IO finished, we can call >> free() on the buffer, and if we're unlucky enough, the freed memory can >> be re-allocated for some other usage, thus modifying the pages before >> the writeback finished. >> >> I think this is the direct cause of the data corruption, page >> modification before direct IO finished. >> >> But unfortunately I still didn't get why the interruption can happen, >> nor how can we handle such interruption? >> (I guess just retry?) > > It's because you are mixing aio/io_uring, and the default settings for > io_uring is to use signal based notifications for queueing task_work. > This then causes a spurious -EINTR, which stops your io_getevents() > wait. Looks like this is a bug in fsstress, it should just retry the > wait if this happens. You can also configure the ring to not use signal > based notifications, but that bug needs fixing regardless. Something like this will probably fix it. diff --git a/ltp/fsstress.c b/ltp/fsstress.c index 6641a525fe5d..05fbfd3f8cf8 100644 --- a/ltp/fsstress.c +++ b/ltp/fsstress.c @@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose) (long long) s->st_blocks, (long long) s->st_size); } +static int io_get_single_event(struct io_event *event) +{ + int ret; + + do { + /* + * We can get -EINTR if competing with io_uring using signal + * based notifications. For that case, just retry the wait. + */ + ret = io_getevents(io_ctx, 1, 1, event, NULL); + if (ret != -EINTR) + break; + } while (1); + + return ret; +} + void afsync_f(opnum_t opno, long r) { @@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r) close(fd); return; } - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { + if ((e = io_get_single_event(&event)) != 1) { if (v) printf("%d/%lld: afsync - io_getevents failed %d\n", procid, opno, e); @@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags) if ((e = io_submit(io_ctx, 1, iocbs)) != 1) { if (v) printf("%d/%lld: %s - io_submit failed %d\n", - procid, opno, iswrite ? "awrite" : "aread", e); + procid, opno, iswrite ? "awrite" : "aread", e); goto aio_out; } - if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) { + if ((e = io_get_single_event(&event)) != 1) { if (v) printf("%d/%lld: %s - io_getevents failed %d\n", procid, opno, iswrite ? "awrite" : "aread", e); -- Jens Axboe