Re: Supporting FALLOC_FL_WRITE_ZEROES in NFS4.2 with WRITE_SAME?

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 18 Mar 2025 23:59:50 +0000

On Tue, 2025-03-18 at 16:52 -0700, Rick Macklem wrote:
> On Tue, Mar 18, 2025 at 4:40 PM Trond Myklebust
> <trondmy@xxxxxxxxxxxxxxx> wrote:
> > 
> > On Tue, 2025-03-18 at 23:37 +0100, Lionel Cons wrote:
> > > On Tue, 18 Mar 2025 at 22:17, Trond Myklebust
> > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > 
> > > > On Tue, 2025-03-18 at 14:03 -0700, Rick Macklem wrote:
> > > > > 
> > > > > The problem I see is that WRITE_SAME isn't defined in a way
> > > > > where
> > > > > the
> > > > > NFSv4 server can only implement zero'ng and fail the rest.
> > > > > As such. I am thinking that a new operation for NFSv4.2 that
> > > > > does
> > > > > writing
> > > > > of zeros might be preferable to trying to (mis)use
> > > > > WROTE_SAME.
> > > > 
> > > > Why wouldn't you just implement DEALLOCATE?
> > > > 
> > > 
> > > Oh my god.
> > > 
> > > NFSv4.2 DEALLOCATE creates a hole in a sparse file, and does NOT
> > > write zeros.
> > > 
> > > "holes" in sparse files (as created by NFSV4.2 DEALLOCATE)
> > > represent
> > > areas of "no data here". For backwards compatibility these holes
> > > do
> > > not produce read errors, they just read as 0x00 bytes. But they
> > > represent ranges where just no data are stored.
> > > Valid data (from allocated data ranges) can be 0x00, but there
> > > are
> > > NOT
> > > holes, they represent VALID DATA.
> > > 
> > > This is an important difference!
> > > For example if we have files, one per week, 700TB file size
> > > (100TB
> > > per
> > > day). Each of those files start as a completely unallocated space
> > > (one
> > > big hole). The data ranges are gradually allocated by writes, and
> > > the
> > > position of the writes in the files represent the time when they
> > > were
> > > collected. If no data were collected during that time that space
> > > remains unallocated (hole), so we can see whether someone
> > > collected
> > > data in that timeframe.
> > > 
> > > Do you understand the difference?
> > > 
> > 
> > Yes. I do understand the difference, but in this case you're
> > literally
> > just talking about accounting. The sparse file created by
> > DEALLOCATE
> > does not need to allocate the blocks (except possibly at the
> > edges). If
> > you need to ensure that those empty blocks are allocated and
> > accounted
> > for, then a follow up call to ALLOCATE will do that for you.
> Unfortunately ZFS knows how to deallocate, but not how to allocate.

So there is no support for the posix fallocate function? Well in the
worst case, your NFS server will just have to emulate it.

> > 
> > $ touch foo
> > $ stat foo
> >   File: foo
> >   Size: 0               Blocks: 0          IO Block: 4096   regular
> > empty file
> > Device: 8,17    Inode: 410924125   Links: 1
> > Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> > Context: unconfined_u:object_r:user_home_t:s0
> > Access: 2025-03-18 19:26:24.113181341 -0400
> > Modify: 2025-03-18 19:26:24.113181341 -0400
> > Change: 2025-03-18 19:26:24.113181341 -0400
> >  Birth: 2025-03-18 19:25:12.988344235 -0400
> > $ truncate -s 1GiB foo
> > $ stat foo
> >   File: foo
> >   Size: 1073741824      Blocks: 0          IO Block: 4096   regular
> > file
> > Device: 8,17    Inode: 410924125   Links: 1
> > Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> > Context: unconfined_u:object_r:user_home_t:s0
> > Access: 2025-03-18 19:26:24.113181341 -0400
> > Modify: 2025-03-18 19:27:35.161694301 -0400
> > Change: 2025-03-18 19:27:35.161694301 -0400
> >  Birth: 2025-03-18 19:25:12.988344235 -0400
> > $ fallocate -z -l 1GiB foo
> > $ stat foo
> >   File: foo
> >   Size: 1073741824      Blocks: 2097152    IO Block: 4096   regular
> > file
> > Device: 8,17    Inode: 410924125   Links: 1
> > Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> > Context: unconfined_u:object_r:user_home_t:s0
> > Access: 2025-03-18 19:26:24.113181341 -0400
> > Modify: 2025-03-18 19:27:54.462817356 -0400
> > Change: 2025-03-18 19:27:54.462817356 -0400
> >  Birth: 2025-03-18 19:25:12.988344235 -0400
> > 
> > 
> > Yes, I also realise that none of the above operations actually
> > resulted
> > in blocks being physically filled with data, but all modern flash
> > based
> > drives tend to have a log structured FTL. So while overwriting data
> > in
> > the HDD era meant that you would usually (unless you had a log
> > based
> > filesystem) overwrite the same physical space with data, today's
> > drives
> > are free to shift the rewritten block to any new physical location
> > in
> > order to ensure even wear levelling of the SSD.
> Yea. The Wr_zero operation writes 0s to the logical block. Does that
> guarantee there is no "old block for the logical block" that still
> holds
> the data? (It does say Wr_zero can be used for secure erasure, but??)
> 
> Good question for which I don't have any idea what the answer is,
> rick

In both the above arguments, you are talking about specific filesystem
implementation details that you'll also have to address with your new
operation.

> 
> > 
> > IOW: there is no real advantage to physically writing out the data
> > unless you have a peculiar interest in wasting time.
> > 
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@xxxxxxxxxxxxxxx
> > 
> > 
> 

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx