Re: Problem with direct IO

Zhengyuan Liu <liuzhengyuang521@xxxxxxxxx> · Thu, 21 Oct 2021 10:21:55 +0800

On Thu, Oct 21, 2021 at 1:37 AM Jan Kara <jack@xxxxxxx> wrote:
>
> On Wed 13-10-21 09:46:46, Zhengyuan Liu wrote:
> > Hi, all
> >
> > we are encounting following Mysql crash problem while importing tables :
> >
> >     2021-09-26T11:22:17.825250Z 0 [ERROR] [MY-013622] [InnoDB] [FATAL]
> >     fsync() returned EIO, aborting.
> >     2021-09-26T11:22:17.825315Z 0 [ERROR] [MY-013183] [InnoDB]
> >     Assertion failure: ut0ut.cc:555 thread 281472996733168
> >
> > At the same time , we found dmesg had following message:
> >
> >     [ 4328.838972] Page cache invalidation failure on direct I/O.
> >     Possible data corruption due to collision with buffered I/O!
> >     [ 4328.850234] File: /data/mysql/data/sysbench/sbtest53.ibd PID:
> >     625 Comm: kworker/42:1
> >
> > Firstly, we doubled Mysql has operating the file with direct IO and
> > buffered IO interlaced, but after some checking we found it did only
> > do direct IO using aio. The problem is exactly from direct-io
> > interface (__generic_file_write_iter) itself.
> >
> > ssize_t __generic_file_write_iter()
> > {
> > ...
> >         if (iocb->ki_flags & IOCB_DIRECT) {
> >                 loff_t pos, endbyte;
> >
> >                 written = generic_file_direct_write(iocb, from);
> >                 /*
> >                  * If the write stopped short of completing, fall back to
> >                  * buffered writes.  Some filesystems do this for writes to
> >                  * holes, for example.  For DAX files, a buffered write will
> >                  * not succeed (even if it did, DAX does not handle dirty
> >                  * page-cache pages correctly).
> >                  */
> >                 if (written < 0 || !iov_iter_count(from) || IS_DAX(inode))
> >                         goto out;
> >
> >                 status = generic_perform_write(file, from, pos = iocb->ki_pos);
> > ...
> > }
> >
> > From above code snippet we can see that direct io could fall back to
> > buffered IO under certain conditions, so even Mysql only did direct IO
> > it could interleave with buffered IO when fall back occurred. I have
> > no idea why FS(ext3) failed the direct IO currently, but it is strange
> > __generic_file_write_iter make direct IO fall back to buffered IO, it
> > seems  breaking the semantics of direct IO.
> >
> > The reproduced  environment is:
> > Platform:  Kunpeng 920 (arm64)
> > Kernel: V5.15-rc
> > PAGESIZE: 64K
> > Mysql:  V8.0
> > Innodb_page_size: default(16K)
>
> Thanks for report. I agree this should not happen. How hard is this to
> reproduce? Any idea whether the fallback to buffered IO happens because
> iomap_dio_rw() returns -ENOTBLK or because it returns short write?

It is easy to reproduce in my test environment, as I said in the previous email
replied to Andrew this problem is related to kernel page size.

> Can you post output of "dumpe2fs -h <device>" for the filesystem where the
> problem happens? Thanks!

Sure, the output is:

# dumpe2fs -h /dev/sda3
dumpe2fs 1.45.3 (14-Jul-2019)
Filesystem volume name:   <none>
Last mounted on:          /data
Filesystem UUID:          09a51146-b325-48bb-be63-c9df539a90a1
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index
filetype needs_recovery sparse_super large_file
Filesystem flags:         unsigned_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              11034624
Block count:              44138240
Reserved block count:     2206912
Free blocks:              43168100
Free inodes:              11034613
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1013
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Filesystem created:       Thu Oct 21 09:42:03 2021
Last mount time:          Thu Oct 21 09:43:36 2021
Last write time:          Thu Oct 21 09:43:36 2021
Mount count:              1
Maximum mount count:      -1
Last checked:             Thu Oct 21 09:42:03 2021
Check interval:           0 (<none>)
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:           256
Required extra isize:     32
Desired extra isize:      32
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      a7b04e61-1209-496d-ab9d-a51009b51ddb
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke
Journal size:             1024M
Journal length:           262144
Journal sequence:         0x00000002
Journal start:            1

BTW， we have  also tested Ext4 and XFS and didn't see direct write fallback.

Thanks,