Re: [PATCH] exfat: fix file not locking when writing zeros in exfat_file_mmap()

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 7 Mar 2024 09:31:36 +1100

[Sorry, missed this when you sent it and I think it deserves a
response...]

On Fri, Jan 26, 2024 at 10:41:14PM +0000, Matthew Wilcox wrote:
> On Sat, Jan 27, 2024 at 09:32:42AM +1100, Dave Chinner wrote:
> > > but the problem is that Microsoft half-arsed their support for holes.
> > > See my other mail in this thread.
> > 
> > Why does that matter?  It's exactly the same problem with any other
> > filesytsem that doesn't support sparse files.
> > 
> > All I said is that IO operations beyond the "valid size" should
> > be treated like a operating in a hole - I pass no judgement on the
> > filesystem design, implementation or level of sparse file support
> > it has. ALl it needs to do is treat the "not valid" size range as if
> > it was a hole or unwritten, regardless of whether the file is sparse
> > or not....
> > 
> > > truncate the file up to 4TB
> > > write a byte at offset 3TB
> > > 
> > > ... now we have to stream 3TB of zeroes through the page cache so that
> > > we can write the byte at 3TB.
> > 
> > This behaviour cannot be avoided on filesystems without sparse file
> > support - the hit of writing zeroes has to be taken somewhere. We
> > can handle this in truncate(), the write() path or in ->page_mkwrite
> > *if* the zeroing condition is hit.  There's no need to do it at
> > mmap() time if that range of the file is not actually written to by
> > the application...
> 
> It's really hard to return -ENOSPC from ->page_mkwrite.  At best you'll
> get a SIGBUS or SEGV.  So truncate() or mmap() are the only two places to
> do it.  And if we do it in truncate() then we can't take any advantage
> of the limited "hole" support the filesystem has.

Yes, but the entire point of ->page-mkwrite was to allow the
filesystems to abort the user data modification if they were at
ENOSPC when the page fault happened. This is way better than getting
into trouble due to space overcommit caused by ignoring ENOSPC
situations during page faults.

This was a significant problem for XFS users back in the mid-2000s
before ->page_mkwrite existed, because both delayed allocation and
writes over unwritten extents requiring ENOSPC to be determined at
page fault time. It was too late if ENOSPC happened at writeback
time or IO completion - this was significant data loss vector and if
we tried to prevent data loss by redirtying the pages then we'd lock
the machine up because dirty pages could not be cleaned and memory
reclaim couldn't make progress...

IOWs, it is far better to immediately terminate the application than
it is to have silent data loss or completely lock the machine up.
Hence the defined semantics of ->page_mkwrite is to send SIGBUS to
the application on ENOSPC. 

It's not an amazing solution but, in reality, it is little different
to the OOM killer triggering when memory reclaim declares ENOMEM. We
can't allow the operation to continue, and we can't return an error
for the application to handle. So we kill it, just like the OOM
killer does in the same situation for ENOMEM.

We even have an fstest that explicitly exercises this case. i.e.
generic/619 creates this mmap() into sparse files situation and then
checks that the filesystem really is at ENOSPC when the page fault
throws a SIGBUS out because ->page_mkwrite failed due to ENOSPC....

> Most files are never mmaped, much less mapped writable.  I think doing
> it in mmap() is the best of several bad options.

Perhaps so, but that makes it unnecessarily different to the major
Linux filesystems....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx