Re: Supporting FALLOC_FL_WRITE_ZEROES in NFS4.2 with WRITE_SAME?

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 18 Mar 2025 23:40:07 +0000

On Tue, 2025-03-18 at 23:37 +0100, Lionel Cons wrote:
> On Tue, 18 Mar 2025 at 22:17, Trond Myklebust
> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > 
> > On Tue, 2025-03-18 at 14:03 -0700, Rick Macklem wrote:
> > > 
> > > The problem I see is that WRITE_SAME isn't defined in a way where
> > > the
> > > NFSv4 server can only implement zero'ng and fail the rest.
> > > As such. I am thinking that a new operation for NFSv4.2 that does
> > > writing
> > > of zeros might be preferable to trying to (mis)use WROTE_SAME.
> > 
> > Why wouldn't you just implement DEALLOCATE?
> > 
> 
> Oh my god.
> 
> NFSv4.2 DEALLOCATE creates a hole in a sparse file, and does NOT
> write zeros.
> 
> "holes" in sparse files (as created by NFSV4.2 DEALLOCATE) represent
> areas of "no data here". For backwards compatibility these holes do
> not produce read errors, they just read as 0x00 bytes. But they
> represent ranges where just no data are stored.
> Valid data (from allocated data ranges) can be 0x00, but there are
> NOT
> holes, they represent VALID DATA.
> 
> This is an important difference!
> For example if we have files, one per week, 700TB file size (100TB
> per
> day). Each of those files start as a completely unallocated space
> (one
> big hole). The data ranges are gradually allocated by writes, and the
> position of the writes in the files represent the time when they were
> collected. If no data were collected during that time that space
> remains unallocated (hole), so we can see whether someone collected
> data in that timeframe.
> 
> Do you understand the difference?
> 

Yes. I do understand the difference, but in this case you're literally
just talking about accounting. The sparse file created by DEALLOCATE
does not need to allocate the blocks (except possibly at the edges). If
you need to ensure that those empty blocks are allocated and accounted
for, then a follow up call to ALLOCATE will do that for you.

$ touch foo
$ stat foo
  File: foo
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: 8,17	Inode: 410924125   Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
Context: unconfined_u:object_r:user_home_t:s0
Access: 2025-03-18 19:26:24.113181341 -0400
Modify: 2025-03-18 19:26:24.113181341 -0400
Change: 2025-03-18 19:26:24.113181341 -0400
 Birth: 2025-03-18 19:25:12.988344235 -0400
$ truncate -s 1GiB foo
$ stat foo
  File: foo
  Size: 1073741824	Blocks: 0          IO Block: 4096   regular file
Device: 8,17	Inode: 410924125   Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
Context: unconfined_u:object_r:user_home_t:s0
Access: 2025-03-18 19:26:24.113181341 -0400
Modify: 2025-03-18 19:27:35.161694301 -0400
Change: 2025-03-18 19:27:35.161694301 -0400
 Birth: 2025-03-18 19:25:12.988344235 -0400
$ fallocate -z -l 1GiB foo 
$ stat foo
  File: foo
  Size: 1073741824	Blocks: 2097152    IO Block: 4096   regular file
Device: 8,17	Inode: 410924125   Links: 1
Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
Context: unconfined_u:object_r:user_home_t:s0
Access: 2025-03-18 19:26:24.113181341 -0400
Modify: 2025-03-18 19:27:54.462817356 -0400
Change: 2025-03-18 19:27:54.462817356 -0400
 Birth: 2025-03-18 19:25:12.988344235 -0400

Yes, I also realise that none of the above operations actually resulted
in blocks being physically filled with data, but all modern flash based
drives tend to have a log structured FTL. So while overwriting data in
the HDD era meant that you would usually (unless you had a log based
filesystem) overwrite the same physical space with data, today's drives
are free to shift the rewritten block to any new physical location in
order to ensure even wear levelling of the SSD.

IOW: there is no real advantage to physically writing out the data
unless you have a peculiar interest in wasting time.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx