Re: [PATCH -RFC 0/2] mm/ext4: avoid data corruption when extending DIO write race with buffered read

Baokun Li <libaokun1@xxxxxxxxxx> · Thu, 7 Dec 2023 22:15:55 +0800

On 2023/12/7 3:37, Jan Kara wrote:
On Tue 05-12-23 20:50:30, Baokun Li wrote:
On 2023/12/4 22:41, Jan Kara wrote:
On Mon 04-12-23 21:50:18, Baokun Li wrote:
On 2023/12/4 20:11, Jan Kara wrote:
The problem is with a one-master-twoslave MYSQL database with three
physical machines, and using sysbench pressure testing on each of the
three machines, the problem occurs about once every two to three hours.

The problem is with the relay log file, and when the problem occurs, the
middle dozens of bytes of the file are read as all zeros, while the data on
disk is not. This is a journal-like file where a write process gets the data
from
the master node and writes it locally, and another replay process reads the
file and performs the replay operation accordingly (some SQL statements).
The problem is that when replaying, it finds that the data read is
corrupted,
not valid SQL data, while the data on disk is normal.

It's not confirmed that buffered reads vs direct IO writes is actually
causing this issue, but this is the only scenario that we can reproduce
with our local simplified scripts. Also, after merging in patch 1, the
MYSQL pressure test scenario has now been tested for 5 days and has not
been reproduced.

I'll double-check the problem scenario, although buffered reads with
buffered writes doesn't seem to have this problem.
Yeah, from what you write it seems that the replay code is using buffered
reads on the journal file. I guess you could confirm that with a bit of
kernel tracing but the symptoms look pretty convincing. Did you try talking
to MYSQL guys about why they are doing this?
The operations performed on the relay log file are buffered reads and
writes, which I confirmed with the following bpftrace script:
```
#include <linux/fs.h>
#include <linux/path.h>
#include <linux/dcache.h>

kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
     printf("read path: %s\n", str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}

kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
     printf("write path: %s\n", str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}
```
I suspect there are DIO writes causing the problem, but I haven't caught
any DIO writes to such files via bpftrace.
Interesting. Not sure how your partially zeroed-out buffers could happen
with fully buffered IO.

After looking at the code again and again, the following concurrency
seems to bypass the memory barrier:

ext4_buffered_write_iter
 generic_perform_write
  copy_page_from_iter_atomic
  ext4_da_write_end
   ext4_da_do_write_end
    block_write_end
     __block_commit_write
      folio_mark_uptodate
       smp_wmb()
       set_bit(PG_uptodate, folio_flags(folio, 0))
    i_size_write(inode, pos + copied)
    // write isize 2048
    unlock_page(page)

ext4_file_read_iter
 generic_file_read_iter
  filemap_read
   filemap_get_pages
    filemap_get_read_batch
    folio_test_uptodate(folio)
     ret = test_bit(PG_uptodate, folio_flags(folio, 0));
     if (ret)
      smp_rmb();
      // The read barrier here ensures
      // that data 0-2048 in the page is synchronized.
                           ext4_buffered_write_iter
                            generic_perform_write
                             copy_page_from_iter_atomic
                             ext4_da_write_end
                              ext4_da_do_write_end
                               block_write_end
                                __block_commit_write
                                 folio_mark_uptodate
                                  smp_wmb()
                                  set_bit(PG_uptodate, 
folio_flags(folio, 0))
                               i_size_write(inode, pos + copied)
                               // write isize 4096
                               unlock_page(page)
   // read isize 4096
   isize = i_size_read(inode)
   // But there is no read barrier here,
   // so the data in the 2048-4096 range
   // may not be synchronized yet !!!
   copy_page_to_iter()
   // copyout 4096

In the concurrency above, we read the updated i_size, but there is
no read barrier to ensure that the data in the page is the same as
the i_size at this point. Therefore, we may copy the unsynchronized
page out. Is it normal for us to read zero-filled data in this case?

Thanks!
--
With Best Regards,
Baokun Li
.