Re: [PATCH v4] fs: Fix data race in inode_set_ctime_to_ts

Mateusz Guzik <mjguzik@xxxxxxxxx> · Sun, 24 Nov 2024 18:56:57 +0100

On Sun, Nov 24, 2024 at 09:44:35AM -0800, Darrick J. Wong wrote:
> On Sun, Nov 24, 2024 at 05:42:53PM +0800, Hao-ran Zheng wrote:
> > A data race may occur when the function `inode_set_ctime_to_ts()` and
> > the function `inode_get_ctime_sec()` are executed concurrently. When
> > two threads call `aio_read` and `aio_write` respectively, they will
> > be distributed to the read and write functions of the corresponding
> > file system respectively. Taking the btrfs file system as an example,
> > the `btrfs_file_read_iter` and `btrfs_file_write_iter` functions are
> > finally called. These two functions created a data race when they
> > finally called `inode_get_ctime_sec()` and `inode_set_ctime_to_ns()`.
> > The specific call stack that appears during testing is as follows:
> > 
> > ============DATA_RACE============
> > btrfs_delayed_update_inode+0x1f61/0x7ce0 [btrfs]
> > btrfs_update_inode+0x45e/0xbb0 [btrfs]
> > btrfs_dirty_inode+0x2b8/0x530 [btrfs]
> > btrfs_update_time+0x1ad/0x230 [btrfs]
> > touch_atime+0x211/0x440
> > filemap_read+0x90f/0xa20
> > btrfs_file_read_iter+0xeb/0x580 [btrfs]
> > aio_read+0x275/0x3a0
> > io_submit_one+0xd22/0x1ce0
> > __se_sys_io_submit+0xb3/0x250
> > do_syscall_64+0xc1/0x190
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > ============OTHER_INFO============
> > btrfs_write_check+0xa15/0x1390 [btrfs]
> > btrfs_buffered_write+0x52f/0x29d0 [btrfs]
> > btrfs_do_write_iter+0x53d/0x1590 [btrfs]
> > btrfs_file_write_iter+0x41/0x60 [btrfs]
> > aio_write+0x41e/0x5f0
> > io_submit_one+0xd42/0x1ce0
> > __se_sys_io_submit+0xb3/0x250
> > do_syscall_64+0xc1/0x190
> > entry_SYSCALL_64_after_hwframe+0x77/0x7f
> > 
> > To address this issue, it is recommended to add WRITE_ONCE
> > and READ_ONCE when reading or writing the `inode->i_ctime_sec`
> > and `inode->i_ctime_nsec` variable.
> 
> Excuse my ignorance, but how exactly does this annotation fix the race?
> Does it prevent reordering of the memory accesses or something?  "it is
> recommended" is not enough for dunces such as myself to figure out why
> this fixes the problem. :(
> 

It prevents the compiler from getting ideas. One hypothetical is
splitting the load from one asm op to several, possibly resulting a
corrupted value when racing against an update

A not hypothethical concerns some like this:
	time64_t sec = inode_get_ctime_sec(inode);
	/* do stuff with sec */
	/* do other stuff */
	/* do more stuff sec */

The compiler might have decided to throw away the read sec value and do
the load again later. Thus if an update happened in the meantime then
the code ends up operating on 2 different values, even though the
programmer clearly intended it to be stable.

However, since both sec and nsec are updated separately and there is no
synchro, reading *both* can still result in values from 2 different
updates which is a bug not addressed by any of the above. To my
underestanding of the vfs folk take on it this is considered tolerable.