On Sun, Nov 24, 2024 at 06:56:57PM +0100, Mateusz Guzik wrote: > On Sun, Nov 24, 2024 at 09:44:35AM -0800, Darrick J. Wong wrote: > > On Sun, Nov 24, 2024 at 05:42:53PM +0800, Hao-ran Zheng wrote: > > > A data race may occur when the function `inode_set_ctime_to_ts()` and > > > the function `inode_get_ctime_sec()` are executed concurrently. When > > > two threads call `aio_read` and `aio_write` respectively, they will > > > be distributed to the read and write functions of the corresponding > > > file system respectively. Taking the btrfs file system as an example, > > > the `btrfs_file_read_iter` and `btrfs_file_write_iter` functions are > > > finally called. These two functions created a data race when they > > > finally called `inode_get_ctime_sec()` and `inode_set_ctime_to_ns()`. > > > The specific call stack that appears during testing is as follows: > > > > > > ============DATA_RACE============ > > > btrfs_delayed_update_inode+0x1f61/0x7ce0 [btrfs] > > > btrfs_update_inode+0x45e/0xbb0 [btrfs] > > > btrfs_dirty_inode+0x2b8/0x530 [btrfs] > > > btrfs_update_time+0x1ad/0x230 [btrfs] > > > touch_atime+0x211/0x440 > > > filemap_read+0x90f/0xa20 > > > btrfs_file_read_iter+0xeb/0x580 [btrfs] > > > aio_read+0x275/0x3a0 > > > io_submit_one+0xd22/0x1ce0 > > > __se_sys_io_submit+0xb3/0x250 > > > do_syscall_64+0xc1/0x190 > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > ============OTHER_INFO============ > > > btrfs_write_check+0xa15/0x1390 [btrfs] > > > btrfs_buffered_write+0x52f/0x29d0 [btrfs] > > > btrfs_do_write_iter+0x53d/0x1590 [btrfs] > > > btrfs_file_write_iter+0x41/0x60 [btrfs] > > > aio_write+0x41e/0x5f0 > > > io_submit_one+0xd42/0x1ce0 > > > __se_sys_io_submit+0xb3/0x250 > > > do_syscall_64+0xc1/0x190 > > > entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > > > > To address this issue, it is recommended to add WRITE_ONCE > > > and READ_ONCE when reading or writing the `inode->i_ctime_sec` > > > and `inode->i_ctime_nsec` variable. > > > > Excuse my ignorance, but how exactly does this annotation fix the race? > > Does it prevent reordering of the memory accesses or something? "it is > > recommended" is not enough for dunces such as myself to figure out why > > this fixes the problem. :( > > > > It prevents the compiler from getting ideas. One hypothetical is > splitting the load from one asm op to several, possibly resulting a > corrupted value when racing against an update > > A not hypothethical concerns some like this: > time64_t sec = inode_get_ctime_sec(inode); > /* do stuff with sec */ > /* do other stuff */ > /* do more stuff sec */ > > The compiler might have decided to throw away the read sec value and do > the load again later. Thus if an update happened in the meantime then > the code ends up operating on 2 different values, even though the > programmer clearly intended it to be stable. <nod> I figured as much, but I wanted the commit message to spell that out for everybody, e.g. "Use READ_ONCE to force compilers to sample the ctime values one time only, and WRITE_ONCE to prevent the compiler from turning one line of code into multiple stores to the same address." > However, since both sec and nsec are updated separately and there is no > synchro, reading *both* can still result in values from 2 different > updates which is a bug not addressed by any of the above. To my > underestanding of the vfs folk take on it this is considered tolerable. <nod> This is the other thing -- the commit message refers to preventing a data race, but there is indeed nothing about access ordering that prevents /that/ race. Someone not an fs developer could interpret that as "the patch does not fully fix all possible problems" whereas the community has decided there's an acceptable tradeoff with not taking i_rwsem. --D