Re: [LSF/MM/BPF TOPIC] Measuring limits and enhancing buffered IO

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Sun, 25 Feb 2024 20:58:28 -0500

On Sun, Feb 25, 2024 at 05:32:14PM -0800, Linus Torvalds wrote:
> On Sun, 25 Feb 2024 at 17:03, Kent Overstreet <kent.overstreet@xxxxxxxxx> wrote:
> >
> > We could satisfy the posix atomic writes rule by just having a properly
> > vectorized buffered write path, no need for the inode lock - it really
> > should just be extending writes that have to hit the inode lock, same as
> > O_DIRECT.
> >
> > (whenever people bring up range locks, I keep trying to tell them - we
> > already have that in the form of the folio lock, if you'd just use it
> > properly...)
> 
> Sadly, that is *technically* not proper.
> 
> IOW, I actually agree with you that the folio lock is sufficient, and
> several filesystems do too.
> 
> BUT.
> 
> Technically, the POSIX requirements are that the atomicity of writes
> are "all or nothing" being visible, and so ext4, for example, will
> have the whole write operation inside the inode_lock.

...

> (It's not just ext2. It's all the old filesystems: anything that uses
> generic_file_write_iter() without doing inode_lock/unlock around it,
> which is actually most of them).

According to my reading just now, ext4 and btrfs (as well as bcachefs)
also don't take the inode lock in the read path - xfs is the only one
that does.

Perhaps we should just lift it to the VFS and make it controllable as a
mount/open option, as nice of a property as it is in theory I can't see
myself wanting to make everyone pay for it if ext4 and btrfs aren't
doing it and no one's screaming.

I think write vs. write consistency is the more interesting case; the
question there is does falling back to the inode lock when we can't lock
all the folios simultaneously work.

Consider thread A doing a 1 MB write, and it ends up in the path where
it locks the inode and it's allowed to write one folio at a time.

Then you have thread B doing some form of overlapping write, but without
the inode lock, and with all the folios locked simultaneously.

I think everything works; we need the end result to be consistent with
some total ordering of all the writes, IOW, thread B's write (if fully
within thread A's write) should be fully overwritten or not at all, and
that clearly is the case. But there may be situations involving more
than two threads where things get weirder.