On Sun, Nov 24, 2024 at 09:44:35AM -0800, Darrick J. Wong wrote: > On Sun, Nov 24, 2024 at 05:42:53PM +0800, Hao-ran Zheng wrote: > > A data race may occur when the function `inode_set_ctime_to_ts()` and > > the function `inode_get_ctime_sec()` are executed concurrently. When > > two threads call `aio_read` and `aio_write` respectively, they will > > be distributed to the read and write functions of the corresponding > > file system respectively. Taking the btrfs file system as an example, > > the `btrfs_file_read_iter` and `btrfs_file_write_iter` functions are > > finally called. These two functions created a data race when they > > finally called `inode_get_ctime_sec()` and `inode_set_ctime_to_ns()`. > > The specific call stack that appears during testing is as follows: > > > > ============DATA_RACE============ > > btrfs_delayed_update_inode+0x1f61/0x7ce0 [btrfs] > > btrfs_update_inode+0x45e/0xbb0 [btrfs] > > btrfs_dirty_inode+0x2b8/0x530 [btrfs] > > btrfs_update_time+0x1ad/0x230 [btrfs] > > touch_atime+0x211/0x440 > > filemap_read+0x90f/0xa20 > > btrfs_file_read_iter+0xeb/0x580 [btrfs] > > aio_read+0x275/0x3a0 > > io_submit_one+0xd22/0x1ce0 > > __se_sys_io_submit+0xb3/0x250 > > do_syscall_64+0xc1/0x190 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > ============OTHER_INFO============ > > btrfs_write_check+0xa15/0x1390 [btrfs] > > btrfs_buffered_write+0x52f/0x29d0 [btrfs] > > btrfs_do_write_iter+0x53d/0x1590 [btrfs] > > btrfs_file_write_iter+0x41/0x60 [btrfs] > > aio_write+0x41e/0x5f0 > > io_submit_one+0xd42/0x1ce0 > > __se_sys_io_submit+0xb3/0x250 > > do_syscall_64+0xc1/0x190 > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > To address this issue, it is recommended to add WRITE_ONCE > > and READ_ONCE when reading or writing the `inode->i_ctime_sec` > > and `inode->i_ctime_nsec` variable. > > Excuse my ignorance, but how exactly does this annotation fix the race? > Does it prevent reordering of the memory accesses or something? "it is > recommended" is not enough for dunces such as myself to figure out why > this fixes the problem. :( > It prevents the compiler from getting ideas. One hypothetical is splitting the load from one asm op to several, possibly resulting a corrupted value when racing against an update A not hypothethical concerns some like this: time64_t sec = inode_get_ctime_sec(inode); /* do stuff with sec */ /* do other stuff */ /* do more stuff sec */ The compiler might have decided to throw away the read sec value and do the load again later. Thus if an update happened in the meantime then the code ends up operating on 2 different values, even though the programmer clearly intended it to be stable. However, since both sec and nsec are updated separately and there is no synchro, reading *both* can still result in values from 2 different updates which is a bug not addressed by any of the above. To my underestanding of the vfs folk take on it this is considered tolerable.