Re: Possible io_uring related race leads to btrfs data csum mismatch

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Tue, 22 Aug 2023 05:42:40 +0800

On 2023/8/21 22:57, Jens Axboe wrote:
On 8/20/23 6:38 PM, Qu Wenruo wrote:


On 2023/8/20 22:11, Jens Axboe wrote:
On 8/20/23 7:26 AM, Jens Axboe wrote:
On 8/19/23 6:22 PM, Qu Wenruo wrote:


On 2023/8/20 07:59, Qu Wenruo wrote:
Hi Jens

I tried more on my side to debug the situation, and found a very weird
write behavior:

       Some unexpected direct IO happened, without corresponding
       fsstress workload.

The workload is:

       $fsstress -p 7 -n 50 -s 1691396493 -w -d $mnt -v > /tmp/fsstress

Which I can reliably reproduce the problem locally, around 1/50
possibility.
In my particular case, it results data corruption at root 5 inode 283
offset 8192.

Then I added some trace points for the following functions:

- btrfs_do_write_iter()
     Two trace points, one before btrfs_direct_write(), and one
     before btrfs_buffered_write(), outputting the aligned and unaligned
     write range, root/inode number, type of the write (buffered or
     direct).

- btrfs_finish_one_ordered()
     This is where btrfs inserts its ordered extent into the subvolume
     tree.
     This happens when a range of pages finishes its writeback.

Then here comes the fsstress log for inode 283 (no btrfs root number):

0/22: clonerange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
[307200,0]
0/23: copyrange d0/f2[283 1 0 0 0 0] [0,0] -> d0/f2[283 1 0 0 0 0]
[1058819,0]
0/25: write d0/f2[283 2 0 0 0 0] [393644,88327] 0
0/29: fallocate(INSERT_RANGE) d0/f3 [283 2 0 0 176 481971]t 884736
585728 95
0/30: uring_write d0/f3[283 2 0 0 176 481971] [1400622, 56456(res=56456)] 0
0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[283 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/34: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0
0/38: dwrite - xfsctl(XFS_IOC_DIOINFO) d0/f3[283 2 308134 1763236 496
1457078] return 25, fallback to stat()
0/38: dwrite d0/f3[283 2 308134 1763236 496 1457078] [2084864,36864] 0
0/39: write d0/d4/f6[283 2 308134 1763236 496 2121728] [2749000,60139] 0
0/40: fallocate(ZERO_RANGE) d0/f3 [283 2 308134 1763236 688 2809139]t
3512660 81075 0
0/43: splice d0/f5[293 1 0 0 1872 2678784] [552619,59420] -> d0/f3[283 2
308134 1763236 856 3593735] [5603798,59420] 0
0/48: fallocate(KEEP_SIZE|PUNCH_HOLE) d0/f3 [283 1 308134 1763236 976
5663218]t 1361821 480392 0
0/49: clonerange d0/f3[283 1 308134 1763236 856 5663218] [2461696,53248]
-> d0/f5[293 1 0 0 1872 2678784] [942080,53248]

Note one thing, there is no direct/buffered write into inode 283 offset
8192.

But from the trace events for root 5 inode 283:

    btrfs_do_write_iter: r/i=5/283 buffered fileoff=393216(393644)
len=90112(88327)
    btrfs_do_write_iter: r/i=5/283 buffered fileoff=1396736(1400622)
len=61440(56456)
    btrfs_do_write_iter: r/i=5/283 buffered fileoff=708608(709121)
len=12288(7712)

    btrfs_do_write_iter: r/i=5/283 direct fileoff=8192(8192)
len=73728(73728) <<<<<

    btrfs_do_write_iter: r/i=5/283 direct fileoff=589824(589824)
len=16384(16384)
    btrfs_finish_one_ordered: r/i=5/283 fileoff=8192 len=73728
    btrfs_finish_one_ordered: r/i=5/283 fileoff=589824 len=16384
    btrfs_do_write_iter: r/i=5/283 direct fileoff=2084864(2084864)
len=36864(36864)
    btrfs_finish_one_ordered: r/i=5/283 fileoff=2084864 len=36864
    btrfs_do_write_iter: r/i=5/283 buffered fileoff=2748416(2749000)
len=61440(60139)
    btrfs_do_write_iter: r/i=5/283 buffered fileoff=5603328(5603798)
len=61440(59420)
    btrfs_finish_one_ordered: r/i=5/283 fileoff=393216 len=90112
    btrfs_finish_one_ordered: r/i=5/283 fileoff=708608 len=12288
    btrfs_finish_one_ordered: r/i=5/283 fileoff=1396736 len=61440
    btrfs_finish_one_ordered: r/i=5/283 fileoff=3592192 len=4096
    btrfs_finish_one_ordered: r/i=5/283 fileoff=2748416 len=61440
    btrfs_finish_one_ordered: r/i=5/283 fileoff=5603328 len=61440

Note that phantom direct IO call, which is in the corrupted range.

If paired with fsstress, that phantom write happens between the two
operations:

0/31: writev d0/f3[283 2 0 0 296 1457078] [709121,8,964] 0
0/34: dwrite d0/f3[283 2 308134 1763236 320 1457078] [589824,16384] 0

Just to be more accurate, there is a 0/33 operation, which is:

0/33: do_aio_rw - xfsctl(XFS_IOC_DIOINFO) d0/f2[285 2 308134 1763236 320
1457078] return 25, fallback to stat()
0/33: awrite - io_getevents failed -4

The failed one doesn't have inode number thus it didn't get caught by grep.

Return value -4 means -INTR, not sure who sent the interruption.
But if this interruption happens before the IO finished, we can call
free() on the buffer, and if we're unlucky enough, the freed memory can
be re-allocated for some other usage, thus modifying the pages before
the writeback finished.

I think this is the direct cause of the data corruption, page
modification before direct IO finished.

But unfortunately I still didn't get why the interruption can happen,
nor how can we handle such interruption?
(I guess just retry?)

It's because you are mixing aio/io_uring, and the default settings for
io_uring is to use signal based notifications for queueing task_work.
This then causes a spurious -EINTR, which stops your io_getevents()
wait. Looks like this is a bug in fsstress, it should just retry the
wait if this happens. You can also configure the ring to not use signal
based notifications, but that bug needs fixing regardless.

Something like this will probably fix it.

diff --git a/ltp/fsstress.c b/ltp/fsstress.c
index 6641a525fe5d..05fbfd3f8cf8 100644
--- a/ltp/fsstress.c
+++ b/ltp/fsstress.c
@@ -2072,6 +2072,23 @@ void inode_info(char *str, size_t sz, struct stat64 *s, int verbose)
                (long long) s->st_blocks, (long long) s->st_size);
   }

+static int io_get_single_event(struct io_event *event)
+{
+    int ret;
+
+    do {
+        /*
+         * We can get -EINTR if competing with io_uring using signal
+         * based notifications. For that case, just retry the wait.
+         */
+        ret = io_getevents(io_ctx, 1, 1, event, NULL);
+        if (ret != -EINTR)
+            break;
+    } while (1);
+
+    return ret;
+}
+
   void
   afsync_f(opnum_t opno, long r)
   {
@@ -2111,7 +2128,7 @@ afsync_f(opnum_t opno, long r)
           close(fd);
           return;
       }
-    if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
+    if ((e = io_get_single_event(&event)) != 1) {
           if (v)
               printf("%d/%lld: afsync - io_getevents failed %d\n",
                      procid, opno, e);
@@ -2220,10 +2237,10 @@ do_aio_rw(opnum_t opno, long r, int flags)
       if ((e = io_submit(io_ctx, 1, iocbs)) != 1) {
           if (v)
               printf("%d/%lld: %s - io_submit failed %d\n",
-                   procid, opno, iswrite ? "awrite" : "aread", e);
+                    procid, opno, iswrite ? "awrite" : "aread", e);
           goto aio_out;
       }
-    if ((e = io_getevents(io_ctx, 1, 1, &event, NULL)) != 1) {
+    if ((e = io_get_single_event(&event)) != 1) {
           if (v)
               printf("%d/%lld: %s - io_getevents failed %d\n",
                      procid, opno, iswrite ? "awrite" : "aread", e);

Exactly what I sent for fsstress:
https://lore.kernel.org/linux-btrfs/20230820010219.12907-1-wqu@xxxxxxxx/T/#u

It's not really, as you only did the one case of io_getevents(). What
happens if the other one gets EINTR and aborts, now we do a rw operation
and the first event returned is the one from the fsync?

Oh my bad, forgot there is another one.

Would fix it in the next update.

Thanks,
Qu

You should not just fix up the one that you happened to hit, fix up both
of them.