Re: Supporting FALLOC_FL_WRITE_ZEROES in NFS4.2 with WRITE_SAME?

Rick Macklem <rick.macklem@xxxxxxxxx> · Tue, 18 Mar 2025 18:43:02 -0700



On Tue, Mar 18, 2025 at 6:32 PM Rick Macklem <rick.macklem@xxxxxxxxx> wrote:
>
> On Tue, Mar 18, 2025 at 5:00 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> >
> > On Tue, 2025-03-18 at 16:52 -0700, Rick Macklem wrote:
> > > On Tue, Mar 18, 2025 at 4:40 PM Trond Myklebust
> > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Tue, 2025-03-18 at 23:37 +0100, Lionel Cons wrote:
> > > > > On Tue, 18 Mar 2025 at 22:17, Trond Myklebust
> > > > > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > On Tue, 2025-03-18 at 14:03 -0700, Rick Macklem wrote:
> > > > > > >
> > > > > > > The problem I see is that WRITE_SAME isn't defined in a way
> > > > > > > where
> > > > > > > the
> > > > > > > NFSv4 server can only implement zero'ng and fail the rest.
> > > > > > > As such. I am thinking that a new operation for NFSv4.2 that
> > > > > > > does
> > > > > > > writing
> > > > > > > of zeros might be preferable to trying to (mis)use
> > > > > > > WROTE_SAME.
> > > > > >
> > > > > > Why wouldn't you just implement DEALLOCATE?
> > > > > >
> > > > >
> > > > > Oh my god.
> > > > >
> > > > > NFSv4.2 DEALLOCATE creates a hole in a sparse file, and does NOT
> > > > > write zeros.
> > > > >
> > > > > "holes" in sparse files (as created by NFSV4.2 DEALLOCATE)
> > > > > represent
> > > > > areas of "no data here". For backwards compatibility these holes
> > > > > do
> > > > > not produce read errors, they just read as 0x00 bytes. But they
> > > > > represent ranges where just no data are stored.
> > > > > Valid data (from allocated data ranges) can be 0x00, but there
> > > > > are
> > > > > NOT
> > > > > holes, they represent VALID DATA.
> > > > >
> > > > > This is an important difference!
> > > > > For example if we have files, one per week, 700TB file size
> > > > > (100TB
> > > > > per
> > > > > day). Each of those files start as a completely unallocated space
> > > > > (one
> > > > > big hole). The data ranges are gradually allocated by writes, and
> > > > > the
> > > > > position of the writes in the files represent the time when they
> > > > > were
> > > > > collected. If no data were collected during that time that space
> > > > > remains unallocated (hole), so we can see whether someone
> > > > > collected
> > > > > data in that timeframe.
> > > > >
> > > > > Do you understand the difference?
> > > > >
> > > >
> > > > Yes. I do understand the difference, but in this case you're
> > > > literally
> > > > just talking about accounting. The sparse file created by
> > > > DEALLOCATE
> > > > does not need to allocate the blocks (except possibly at the
> > > > edges). If
> > > > you need to ensure that those empty blocks are allocated and
> > > > accounted
> > > > for, then a follow up call to ALLOCATE will do that for you.
> > > Unfortunately ZFS knows how to deallocate, but not how to allocate.
> >
> > So there is no support for the posix fallocate function? Well in the
> > worst case, your NFS server will just have to emulate it.
> The NFS server cannot emulate it (writing a block of zeros with
> write does not guarantee a future write of the same byte range won't
> fail with ENOSPACE.
> --> The FreeBSD NFSv4.2 server replies NFS4ERR_NOTSUPP.
>
> The bothersome part is "there is a hint in the NFSv4.2 RFC that
> support is "per server" and not "per file system".  As such, the server
> replies NFS4ERR_NOTSUPP even if there is a mix of UFS and ZFS
> file systems exported. (UFS can do allocate.)
>
> Now, what happens when fallocate() is attempted on ZFS?
> It should fail, but I am not sure. If it succeeds, I would consider
> that a bug.
>
> >
> > > >
> > > > $ touch foo
> > > > $ stat foo
> > > >   File: foo
> > > >   Size: 0               Blocks: 0          IO Block: 4096   regular
> > > > empty file
> > > > Device: 8,17    Inode: 410924125   Links: 1
> > > > Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> > > > Context: unconfined_u:object_r:user_home_t:s0
> > > > Access: 2025-03-18 19:26:24.113181341 -0400
> > > > Modify: 2025-03-18 19:26:24.113181341 -0400
> > > > Change: 2025-03-18 19:26:24.113181341 -0400
> > > >  Birth: 2025-03-18 19:25:12.988344235 -0400
> > > > $ truncate -s 1GiB foo
> > > > $ stat foo
> > > >   File: foo
> > > >   Size: 1073741824      Blocks: 0          IO Block: 4096   regular
> > > > file
> > > > Device: 8,17    Inode: 410924125   Links: 1
> > > > Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> > > > Context: unconfined_u:object_r:user_home_t:s0
> > > > Access: 2025-03-18 19:26:24.113181341 -0400
> > > > Modify: 2025-03-18 19:27:35.161694301 -0400
> > > > Change: 2025-03-18 19:27:35.161694301 -0400
> > > >  Birth: 2025-03-18 19:25:12.988344235 -0400
> > > > $ fallocate -z -l 1GiB foo
> > > > $ stat foo
> > > >   File: foo
> > > >   Size: 1073741824      Blocks: 2097152    IO Block: 4096   regular
> > > > file
> > > > Device: 8,17    Inode: 410924125   Links: 1
> > > > Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> > > > Context: unconfined_u:object_r:user_home_t:s0
> > > > Access: 2025-03-18 19:26:24.113181341 -0400
> > > > Modify: 2025-03-18 19:27:54.462817356 -0400
> > > > Change: 2025-03-18 19:27:54.462817356 -0400
> > > >  Birth: 2025-03-18 19:25:12.988344235 -0400
> > > >
> > > >
> > > > Yes, I also realise that none of the above operations actually
> > > > resulted
> > > > in blocks being physically filled with data, but all modern flash
> > > > based
> > > > drives tend to have a log structured FTL. So while overwriting data
> > > > in
> > > > the HDD era meant that you would usually (unless you had a log
> > > > based
> > > > filesystem) overwrite the same physical space with data, today's
> > > > drives
> > > > are free to shift the rewritten block to any new physical location
> > > > in
> > > > order to ensure even wear levelling of the SSD.
> > > Yea. The Wr_zero operation writes 0s to the logical block. Does that
> > > guarantee there is no "old block for the logical block" that still
> > > holds
> > > the data? (It does say Wr_zero can be used for secure erasure, but??)
> > >
> > > Good question for which I don't have any idea what the answer is,
> > > rick
> >
> > In both the above arguments, you are talking about specific filesystem
> > implementation details that you'll also have to address with your new
> > operation.
> All the operation spec. needs to say is "writes zeros to the file byte range"
> and make it clear that the 0s written is data and not a hole.
> (As for whether or not there is hardware offload is a server implementation
> detail.)
Oh, and it would also need to deal with the case of "partial success".
(In the NFSV4.2 tradition, it could return the # of bytes of zeros written.)

rick

>
> As for guarantees w.r.t. data being overwritten, I think that would be
> beyond what would be required.
> (Data erasure is an interesting but different topic for which I do not
> have any expertise.)
>
> rick
>
>
> >
> > >
> > > >
> > > > IOW: there is no real advantage to physically writing out the data
> > > > unless you have a peculiar interest in wasting time.
> > > >
> > > > --
> > > > Trond Myklebust
> > > > Linux NFS client maintainer, Hammerspace
> > > > trond.myklebust@xxxxxxxxxxxxxxx
> > > >
> > > >
> > >
> >
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > trond.myklebust@xxxxxxxxxxxxxxx
> >
> >