Re: Problem with direct IO

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2021/10/21 16:03, Jan Kara wrote:
On Thu 21-10-21 10:21:55, Zhengyuan Liu wrote:
On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@xxxxxxx> wrote:
On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
we are encounting following Mysql crash problem while importing tables :

    2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
    fsync() returned EIO, aborting.
    2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
    Assertion failure: ut0ut.cc:555 thread 281472996733168

At the same time , we found dmesg had following message:

    [ 4328.838972] Page cache invalidation failure on direct I/O.
    Possible data corruption due to collision with buffered I/O!
    [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
    625 Comm: kworker/42:1

Firstly, we doubled Mysql has operating the file with direct IO and
buffered IO interlaced, but after some checking we found it did only
do direct IO using aio. The problem is exactly from direct-io
interface (__generic_file_write_iter) itself.

ssize_t __generic_file_write_iter()
{
...
        if (iocb->ki_flags & IOCB_DIRECT) {
                loff_t pos, endbyte;

                written = generic_file_direct_write(iocb, from);
                /*
                 * If the write stopped short of completing, fall back to
                 * buffered writes.  Some filesystems do this for writes to
                 * holes, for example.  For DAX files, a buffered write will
                 * not succeed (even if it did, DAX does not handle dirty
                 * page-cache pages correctly).
                 */
                if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
                        goto out;

                status = generic_perform_write(file, from, pos = iocb->ki_pos);
...
}

>From above code snippet we can see that direct io could fall back to
buffered IO under certain conditions, so even Mysql only did direct IO
it could interleave with buffered IO when fall back occurred. I have
no idea why FS(ext3) failed the direct IO currently, but it is strange
__generic_file_write_iter make direct IO fall back to buffered IO, it
seems  breaking the semantics of direct IO.

The reproduced  environment is:
Platform:  Kunpeng 920 (arm64)
Kernel: V5.15-rc
PAGESIZE: 64K
Mysql:  V8.0
Innodb_page_size: default(16K)
Thanks for report. I agree this should not happen. How hard is this to
reproduce? Any idea whether the fallback to buffered IO happens because
iomap_dio_rw() returns -ENOTBLK or because it returns short write?
It is easy to reproduce in my test environment, as I said in the previous
email replied to Andrew this problem is related to kernel page size.
Ok, can you share a reproducer?

Can you post output of "dumpe2fs -h <device>" for the filesystem where the
problem happens? Thanks!
Sure, the output is:

# dumpe2fs -h /dev/sda3
dumpe2fs 1.45.3 (14-Jul-2019)
Filesystem volume name:   <none>
Last mounted on:          /data
Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
filetype needs_recovery sparse_super large_file
Thanks for the data. OK, a filesystem without extents. Does your test by
any chance try to do direct IO to a hole in a file? Because that is not
(and never was) supported without extents. Also the fact that you don't see
the problem with ext4 (which means extents support) would be pointing in
that direction.

								Honza

I try to reprodeuce this issue as follows, maybe this is just one scenario :
step1:  Modify  kernel code, add delay in dio_complete and dio_bio_end_aio
step2:  mkfs.ext4 /dev/sda
step3:  mount /dev/sda  /home/test
step4:  fio -filename=/home/test/test  -direct=0 -buffered=1 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjobs=1 -name=test_r_w
step5:  fio -filename=/home/test/test  -direct=1 -buffered=0 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=4k -numjobs=1 -name=test_r_w &
step6:  fio -filename=/home/test/test  -direct=0 -buffered=1 -iodepth 1 -thread -rw=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjobs=1 -name=test_r_w

Kernel modification base on v4.19:
diff --git a/fs/direct-io.c b/fs/direct-io.c
+#include <linux/delay.h>

/**
@@ -305,6 +307,8 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
         * end_io() when necessary, otherwise a racing buffer read would cache
         * zeros from unwritten extents.
         */
+       printk("%s: before invalidate page \n", __func__);
+       mdelay(10 * 1000);
        if (flags & DIO_COMPLETE_INVALIDATE &&
            ret > 0 && dio->op == REQ_OP_WRITE &&
            dio->inode->i_mapping->nrpages) {
@@ -314,6 +318,7 @@ static ssize_t dio_complete(struct dio *dio, ssize_t ret, unsigned int flags)
                if (err)
                        dio_warn_stale_pagecache(dio->iocb->ki_filp);
        }
+       printk("%s: end invalidate page \n", __func__);

        inode_dio_end(dio->inode);

@@ -371,10 +378,17 @@ static void dio_bio_end_aio(struct bio *bio)
                 * went in between AIO submission and completion into the
                 * same region.
                 */
+               printk("%s:start to  sleep 5s.... \n", __func__);
+               mdelay(10 * 1000);
+               printk("%s: endto  sleep, defer_completion=%d op=%d nrpages=%d\n",
+                       __func__,dio->defer_completion, dio->op, dio->inode->i_mapping->nrpages);
+
                if (dio->result)
                        defer_completion = dio->defer_completion ||
                                           (dio->op == REQ_OP_WRITE &&
                                            dio->inode->i_mapping->nrpages);
+               printk("%s: dio=%px defer_completion=%d\n",
+                       __func__, dio, defer_completion);
                if (defer_completion) {
                        INIT_WORK(&dio->complete_work, dio_aio_complete_work);
                        queue_work(dio->inode->i_sb->s_dio_done_wq,


Reproduce result :
s=1 -name=test_r_w=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjob
test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 thread
test_r_w: (groupid=0, jobs=1): err= 0: pid=2083: Sun Sep 26 03:18:06 2021
  write: IOPS=444, BW=1778KiB/s (1820kB/s)(16.0KiB/9msec)
    slat (usec): min=1294, max=3789, avg=2007.88, stdev=1191.07
    clat (nsec): min=2304, max=28645, avg=9918.75, stdev=12574.80
     lat (usec): min=1298, max=3824, avg=2020.36, stdev=1206.00
    clat percentiles (nsec):
     |  1.00th=[ 2320],  5.00th=[ 2320], 10.00th=[ 2320], 20.00th=[ 2320],
     | 30.00th=[ 2960], 40.00th=[ 2960], 50.00th=[ 2960], 60.00th=[ 5792],
     | 70.00th=[ 5792], 80.00th=[28544], 90.00th=[28544], 95.00th=[28544],
     | 99.00th=[28544], 99.50th=[28544], 99.90th=[28544], 99.95th=[28544],
     | 99.99th=[28544]
  lat (usec)   : 4=50.00%, 10=25.00%, 50=25.00%
  cpu          : usr=0.00%, sys=87.50%, ctx=0, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1778KiB/s (1820kB/s), 1778KiB/s-1778KiB/s (1820kB/s-1820kB/s), io=16.0KiB (16.4kB), run=9-9msec

Disk stats (read/write):
  sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%
=1 -name=test_r_w &write -rwmixread=70 -ioengine=libaio -bs=4k -size=4k -numjobs
[1] 2087
[root@localhost home]# test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 thread

[root@localhost home]# [ 1475.286928] ext4_file_write_iter: start lock inode
[ 1475.288807] ext4_file_write_iter: unlock inode
[ 1475.290402] dio_bio_end_aio:start to  sleep 5s....

s=1 -name=test_r_w=write -rwmixread=70 -ioengine=libaio -bs=4k -size=16k -numjob
test_r_w: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=1
fio-3.7
Starting 1 thread
Jobs: 1 (f=0): [f(1)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
Jobs: 1 (f=0): [f(1)][-.-%][r=0KiB/s,w=0KiB/s][r=0,w=0 IOPS][eta 00m:00s]
[ 1485.292939] dio_bio_end_aio: endto  sleep, defer_completion=0 op=1 nrpages=4
[ 1485.293888] dio_bio_end_aio: dio=ffff88839aff5200 defer_completion=1
[ 1485.294821] dio_aio_complete_work: dio=ffff88839aff5200...
Jobs: 1 (f=1): [W(1)][-.-%][r=0KiB/s,w=0KiB/s][r[ 1485.296539] dio_complete: before invalidate page
[ 1495.298472] Page cache invalidation failure on direct I/O.  Possible data corruption due to collision with buffered I/O!     -->detect  buffer io and direct io conflict.
[ 1495.302746] File: /home/test/test PID: 493 Comm: kworker/3:2
[ 1495.303508] CPU: 3 PID: 493 Comm: kworker/3:2 Not tainted 4.19.90-dirty #96
[ 1495.304386] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
Jobs: 1 (f=1): [[ 1495.306050] Workqueue: dio/sda dio_aio_complete_work
W(1)][-.-%][r=0K
[ 1495.316879] dio_complete: end invalidate page
test_r_w: (groupid=0, jobs=1): err= 0: pid=2092: Sun Sep 26 03:18:37 2021
  write: IOPS=266, BW=1067KiB/s (1092kB/s)(16.0KiB/15msec)
    slat (usec): min=2001, max=3101, avg=2691.97, stdev=480.07
    clat (nsec): min=2058, max=12337, avg=5999.50, stdev=4456.54
     lat (usec): min=2004, max=3116, avg=2700.11, stdev=484.22
    clat percentiles (nsec):
     |  1.00th=[ 2064],  5.00th=[ 2064], 10.00th=[ 2064], 20.00th=[ 2064],
     | 30.00th=[ 4080], 40.00th=[ 4080], 50.00th=[ 4080], 60.00th=[ 5536],
     | 70.00th=[ 5536], 80.00th=[12352], 90.00th=[12352], 95.00th=[12352],
     | 99.00th=[12352], 99.50th=[12352], 99.90th=[12352], 99.95th=[12352],
     | 99.99th=[12352]
  lat (usec)   : 4=25.00%, 10=50.00%, 20=25.00%
  cpu          : usr=0.00%, sys=100.00%, ctx=1, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,4,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=1067KiB/s (1092kB/s), 1067KiB/s-1067KiB/s (1092kB/s-1092kB/s), io=16.0KiB (16.4kB), run=15-15msec

Disk stats (read/write):
  sda: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%

test_r_w: (groupid=0, jobs=1): err= 0: pid=2089: Sun Sep 26 03:18:37 2021
  write: IOPS=0, BW=204B/s (204B/s)(4096B/20033msec)
    slat (nsec): min=3358.7k, max=3358.7k, avg=3358748.00, stdev= 0.00
    clat (nsec): min=20027M, max=20027M, avg=20027265355.00, stdev= 0.00
     lat (nsec): min=20031M, max=20031M, avg=20030627283.00, stdev= 0.00
    clat percentiles (msec):
     |  1.00th=[17113],  5.00th=[17113], 10.00th=[17113], 20.00th=[17113],
     | 30.00th=[17113], 40.00th=[17113], 50.00th=[17113], 60.00th=[17113],
     | 70.00th=[17113], 80.00th=[17113], 90.00th=[17113], 95.00th=[17113],
     | 99.00th=[17113], 99.50th=[17113], 99.90th=[17113], 99.95th=[17113],
     | 99.99th=[17113]
  cpu          : usr=0.00%, sys=0.02%, ctx=2, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=204B/s (204B/s), 204B/s-204B/s (204B/s-204B/s), io=4096B (4096B), run=20033-20033msec

Disk stats (read/write):
  sda: ios=0/5, merge=0/1, ticks=0/10018, in_queue=10015, util=36.98%




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux