On 2023/12/7 3:37, Jan Kara wrote:
On Tue 05-12-23 20:50:30, Baokun Li wrote:
On 2023/12/4 22:41, Jan Kara wrote:
On Mon 04-12-23 21:50:18, Baokun Li wrote:
On 2023/12/4 20:11, Jan Kara wrote:
The problem is with a one-master-twoslave MYSQL database with three
physical machines, and using sysbench pressure testing on each of the
three machines, the problem occurs about once every two to three hours.
The problem is with the relay log file, and when the problem occurs, the
middle dozens of bytes of the file are read as all zeros, while the data on
disk is not. This is a journal-like file where a write process gets the data
from
the master node and writes it locally, and another replay process reads the
file and performs the replay operation accordingly (some SQL statements).
The problem is that when replaying, it finds that the data read is
corrupted,
not valid SQL data, while the data on disk is normal.
It's not confirmed that buffered reads vs direct IO writes is actually
causing this issue, but this is the only scenario that we can reproduce
with our local simplified scripts. Also, after merging in patch 1, the
MYSQL pressure test scenario has now been tested for 5 days and has not
been reproduced.
I'll double-check the problem scenario, although buffered reads with
buffered writes doesn't seem to have this problem.
Yeah, from what you write it seems that the replay code is using buffered
reads on the journal file. I guess you could confirm that with a bit of
kernel tracing but the symptoms look pretty convincing. Did you try talking
to MYSQL guys about why they are doing this?
The operations performed on the relay log file are buffered reads and
writes, which I confirmed with the following bpftrace script:
```
#include <linux/fs.h>
#include <linux/path.h>
#include <linux/dcache.h>
kprobe:generic_file_buffered_read /!strncmp(str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
printf("read path: %s\n", str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}
kprobe:ext4_buffered_write_iter /!strncmp(str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name), "relay", 5)/ {
printf("write path: %s\n", str(((struct kiocb
*)arg0)->ki_filp->f_path.dentry->d_name.name));
}
```
I suspect there are DIO writes causing the problem, but I haven't caught
any DIO writes to such files via bpftrace.
Interesting. Not sure how your partially zeroed-out buffers could happen
with fully buffered IO.
After looking at the code again and again, the following concurrency
seems to bypass the memory barrier:
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
smp_wmb()
set_bit(PG_uptodate, folio_flags(folio, 0))
i_size_write(inode, pos + copied)
// write isize 2048
unlock_page(page)
ext4_file_read_iter
generic_file_read_iter
filemap_read
filemap_get_pages
filemap_get_read_batch
folio_test_uptodate(folio)
ret = test_bit(PG_uptodate, folio_flags(folio, 0));
if (ret)
smp_rmb();
// The read barrier here ensures
// that data 0-2048 in the page is synchronized.
ext4_buffered_write_iter
generic_perform_write
copy_page_from_iter_atomic
ext4_da_write_end
ext4_da_do_write_end
block_write_end
__block_commit_write
folio_mark_uptodate
smp_wmb()
set_bit(PG_uptodate,
folio_flags(folio, 0))
i_size_write(inode, pos + copied)
// write isize 4096
unlock_page(page)
// read isize 4096
isize = i_size_read(inode)
// But there is no read barrier here,
// so the data in the 2048-4096 range
// may not be synchronized yet !!!
copy_page_to_iter()
// copyout 4096
In the concurrency above, we read the updated i_size, but there is
no read barrier to ensure that the data in the page is the same as
the i_size at this point. Therefore, we may copy the unsynchronized
page out. Is it normal for us to read zero-filled data in this case?
Thanks!
--
With Best Regards,
Baokun Li
.