[Sorry, missed this when you sent it and I think it deserves a response...] On Fri, Jan 26, 2024 at 10:41:14PM +0000, Matthew Wilcox wrote: > On Sat, Jan 27, 2024 at 09:32:42AM +1100, Dave Chinner wrote: > > > but the problem is that Microsoft half-arsed their support for holes. > > > See my other mail in this thread. > > > > Why does that matter? It's exactly the same problem with any other > > filesytsem that doesn't support sparse files. > > > > All I said is that IO operations beyond the "valid size" should > > be treated like a operating in a hole - I pass no judgement on the > > filesystem design, implementation or level of sparse file support > > it has. ALl it needs to do is treat the "not valid" size range as if > > it was a hole or unwritten, regardless of whether the file is sparse > > or not.... > > > > > truncate the file up to 4TB > > > write a byte at offset 3TB > > > > > > ... now we have to stream 3TB of zeroes through the page cache so that > > > we can write the byte at 3TB. > > > > This behaviour cannot be avoided on filesystems without sparse file > > support - the hit of writing zeroes has to be taken somewhere. We > > can handle this in truncate(), the write() path or in ->page_mkwrite > > *if* the zeroing condition is hit. There's no need to do it at > > mmap() time if that range of the file is not actually written to by > > the application... > > It's really hard to return -ENOSPC from ->page_mkwrite. At best you'll > get a SIGBUS or SEGV. So truncate() or mmap() are the only two places to > do it. And if we do it in truncate() then we can't take any advantage > of the limited "hole" support the filesystem has. Yes, but the entire point of ->page-mkwrite was to allow the filesystems to abort the user data modification if they were at ENOSPC when the page fault happened. This is way better than getting into trouble due to space overcommit caused by ignoring ENOSPC situations during page faults. This was a significant problem for XFS users back in the mid-2000s before ->page_mkwrite existed, because both delayed allocation and writes over unwritten extents requiring ENOSPC to be determined at page fault time. It was too late if ENOSPC happened at writeback time or IO completion - this was significant data loss vector and if we tried to prevent data loss by redirtying the pages then we'd lock the machine up because dirty pages could not be cleaned and memory reclaim couldn't make progress... IOWs, it is far better to immediately terminate the application than it is to have silent data loss or completely lock the machine up. Hence the defined semantics of ->page_mkwrite is to send SIGBUS to the application on ENOSPC. It's not an amazing solution but, in reality, it is little different to the OOM killer triggering when memory reclaim declares ENOMEM. We can't allow the operation to continue, and we can't return an error for the application to handle. So we kill it, just like the OOM killer does in the same situation for ENOMEM. We even have an fstest that explicitly exercises this case. i.e. generic/619 creates this mmap() into sparse files situation and then checks that the filesystem really is at ENOSPC when the page fault throws a SIGBUS out because ->page_mkwrite failed due to ENOSPC.... > Most files are never mmaped, much less mapped writable. I think doing > it in mmap() is the best of several bad options. Perhaps so, but that makes it unnecessarily different to the major Linux filesystems.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx