Re: Supporting FALLOC_FL_WRITE_ZEROES in NFS4.2 with WRITE_SAME?

Rick Macklem <rick.macklem@xxxxxxxxx> · Tue, 18 Mar 2025 18:32:27 -0700

On Tue, Mar 18, 2025 at 5:00 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Tue, 2025-03-18 at 16:52 -0700, Rick Macklem wrote:
> > On Tue, Mar 18, 2025 at 4:40 PM Trond Myklebust
> > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Tue, 2025-03-18 at 23:37 +0100, Lionel Cons wrote:
> > > > On Tue, 18 Mar 2025 at 22:17, Trond Myklebust
> > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On Tue, 2025-03-18 at 14:03 -0700, Rick Macklem wrote:
> > > > > >
> > > > > > The problem I see is that WRITE_SAME isn't defined in a way
> > > > > > where
> > > > > > the
> > > > > > NFSv4 server can only implement zero'ng and fail the rest.
> > > > > > As such. I am thinking that a new operation for NFSv4.2 that
> > > > > > does
> > > > > > writing
> > > > > > of zeros might be preferable to trying to (mis)use
> > > > > > WROTE_SAME.
> > > > >
> > > > > Why wouldn't you just implement DEALLOCATE?
> > > > >
> > > >
> > > > Oh my god.
> > > >
> > > > NFSv4.2 DEALLOCATE creates a hole in a sparse file, and does NOT
> > > > write zeros.
> > > >
> > > > "holes" in sparse files (as created by NFSV4.2 DEALLOCATE)
> > > > represent
> > > > areas of "no data here". For backwards compatibility these holes
> > > > do
> > > > not produce read errors, they just read as 0x00 bytes. But they
> > > > represent ranges where just no data are stored.
> > > > Valid data (from allocated data ranges) can be 0x00, but there
> > > > are
> > > > NOT
> > > > holes, they represent VALID DATA.
> > > >
> > > > This is an important difference!
> > > > For example if we have files, one per week, 700TB file size
> > > > (100TB
> > > > per
> > > > day). Each of those files start as a completely unallocated space
> > > > (one
> > > > big hole). The data ranges are gradually allocated by writes, and
> > > > the
> > > > position of the writes in the files represent the time when they
> > > > were
> > > > collected. If no data were collected during that time that space
> > > > remains unallocated (hole), so we can see whether someone
> > > > collected
> > > > data in that timeframe.
> > > >
> > > > Do you understand the difference?
> > > >
> > >
> > > Yes. I do understand the difference, but in this case you're
> > > literally
> > > just talking about accounting. The sparse file created by
> > > DEALLOCATE
> > > does not need to allocate the blocks (except possibly at the
> > > edges). If
> > > you need to ensure that those empty blocks are allocated and
> > > accounted
> > > for, then a follow up call to ALLOCATE will do that for you.
> > Unfortunately ZFS knows how to deallocate, but not how to allocate.
>
> So there is no support for the posix fallocate function? Well in the
> worst case, your NFS server will just have to emulate it.
The NFS server cannot emulate it (writing a block of zeros with
write does not guarantee a future write of the same byte range won't
fail with ENOSPACE.
--> The FreeBSD NFSv4.2 server replies NFS4ERR_NOTSUPP.

The bothersome part is "there is a hint in the NFSv4.2 RFC that
support is "per server" and not "per file system".  As such, the server
replies NFS4ERR_NOTSUPP even if there is a mix of UFS and ZFS
file systems exported. (UFS can do allocate.)

Now, what happens when fallocate() is attempted on ZFS?
It should fail, but I am not sure. If it succeeds, I would consider
that a bug.

>
> > >
> > > $ touch foo
> > > $ stat foo
> > >   File: foo
> > >   Size: 0               Blocks: 0          IO Block: 4096   regular
> > > empty file
> > > Device: 8,17    Inode: 410924125   Links: 1
> > > Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> > > Context: unconfined_u:object_r:user_home_t:s0
> > > Access: 2025-03-18 19:26:24.113181341 -0400
> > > Modify: 2025-03-18 19:26:24.113181341 -0400
> > > Change: 2025-03-18 19:26:24.113181341 -0400
> > >  Birth: 2025-03-18 19:25:12.988344235 -0400
> > > $ truncate -s 1GiB foo
> > > $ stat foo
> > >   File: foo
> > >   Size: 1073741824      Blocks: 0          IO Block: 4096   regular
> > > file
> > > Device: 8,17    Inode: 410924125   Links: 1
> > > Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> > > Context: unconfined_u:object_r:user_home_t:s0
> > > Access: 2025-03-18 19:26:24.113181341 -0400
> > > Modify: 2025-03-18 19:27:35.161694301 -0400
> > > Change: 2025-03-18 19:27:35.161694301 -0400
> > >  Birth: 2025-03-18 19:25:12.988344235 -0400
> > > $ fallocate -z -l 1GiB foo
> > > $ stat foo
> > >   File: foo
> > >   Size: 1073741824      Blocks: 2097152    IO Block: 4096   regular
> > > file
> > > Device: 8,17    Inode: 410924125   Links: 1
> > > Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> > > Context: unconfined_u:object_r:user_home_t:s0
> > > Access: 2025-03-18 19:26:24.113181341 -0400
> > > Modify: 2025-03-18 19:27:54.462817356 -0400
> > > Change: 2025-03-18 19:27:54.462817356 -0400
> > >  Birth: 2025-03-18 19:25:12.988344235 -0400
> > >
> > >
> > > Yes, I also realise that none of the above operations actually
> > > resulted
> > > in blocks being physically filled with data, but all modern flash
> > > based
> > > drives tend to have a log structured FTL. So while overwriting data
> > > in
> > > the HDD era meant that you would usually (unless you had a log
> > > based
> > > filesystem) overwrite the same physical space with data, today's
> > > drives
> > > are free to shift the rewritten block to any new physical location
> > > in
> > > order to ensure even wear levelling of the SSD.
> > Yea. The Wr_zero operation writes 0s to the logical block. Does that
> > guarantee there is no "old block for the logical block" that still
> > holds
> > the data? (It does say Wr_zero can be used for secure erasure, but??)
> >
> > Good question for which I don't have any idea what the answer is,
> > rick
>
> In both the above arguments, you are talking about specific filesystem
> implementation details that you'll also have to address with your new
> operation.
All the operation spec. needs to say is "writes zeros to the file byte range"
and make it clear that the 0s written is data and not a hole.
(As for whether or not there is hardware offload is a server implementation
detail.)

As for guarantees w.r.t. data being overwritten, I think that would be
beyond what would be required.
(Data erasure is an interesting but different topic for which I do not
have any expertise.)

rick

>
> >
> > >
> > > IOW: there is no real advantage to physically writing out the data
> > > unless you have a peculiar interest in wasting time.
> > >
> > > --
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > trond.myklebust@xxxxxxxxxxxxxxx
> > >
> > >
> >
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>