Re: Supporting FALLOC_FL_WRITE_ZEROES in NFS4.2 with WRITE_SAME?

Rick Macklem <rick.macklem@xxxxxxxxx> · Tue, 18 Mar 2025 16:52:20 -0700



On Tue, Mar 18, 2025 at 4:40 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
> On Tue, 2025-03-18 at 23:37 +0100, Lionel Cons wrote:
> > On Tue, 18 Mar 2025 at 22:17, Trond Myklebust
> > <trondmy@xxxxxxxxxxxxxxx> wrote:
> > >
> > > On Tue, 2025-03-18 at 14:03 -0700, Rick Macklem wrote:
> > > >
> > > > The problem I see is that WRITE_SAME isn't defined in a way where
> > > > the
> > > > NFSv4 server can only implement zero'ng and fail the rest.
> > > > As such. I am thinking that a new operation for NFSv4.2 that does
> > > > writing
> > > > of zeros might be preferable to trying to (mis)use WROTE_SAME.
> > >
> > > Why wouldn't you just implement DEALLOCATE?
> > >
> >
> > Oh my god.
> >
> > NFSv4.2 DEALLOCATE creates a hole in a sparse file, and does NOT
> > write zeros.
> >
> > "holes" in sparse files (as created by NFSV4.2 DEALLOCATE) represent
> > areas of "no data here". For backwards compatibility these holes do
> > not produce read errors, they just read as 0x00 bytes. But they
> > represent ranges where just no data are stored.
> > Valid data (from allocated data ranges) can be 0x00, but there are
> > NOT
> > holes, they represent VALID DATA.
> >
> > This is an important difference!
> > For example if we have files, one per week, 700TB file size (100TB
> > per
> > day). Each of those files start as a completely unallocated space
> > (one
> > big hole). The data ranges are gradually allocated by writes, and the
> > position of the writes in the files represent the time when they were
> > collected. If no data were collected during that time that space
> > remains unallocated (hole), so we can see whether someone collected
> > data in that timeframe.
> >
> > Do you understand the difference?
> >
>
> Yes. I do understand the difference, but in this case you're literally
> just talking about accounting. The sparse file created by DEALLOCATE
> does not need to allocate the blocks (except possibly at the edges). If
> you need to ensure that those empty blocks are allocated and accounted
> for, then a follow up call to ALLOCATE will do that for you.
Unfortunately ZFS knows how to deallocate, but not how to allocate.

>
> $ touch foo
> $ stat foo
>   File: foo
>   Size: 0               Blocks: 0          IO Block: 4096   regular empty file
> Device: 8,17    Inode: 410924125   Links: 1
> Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> Context: unconfined_u:object_r:user_home_t:s0
> Access: 2025-03-18 19:26:24.113181341 -0400
> Modify: 2025-03-18 19:26:24.113181341 -0400
> Change: 2025-03-18 19:26:24.113181341 -0400
>  Birth: 2025-03-18 19:25:12.988344235 -0400
> $ truncate -s 1GiB foo
> $ stat foo
>   File: foo
>   Size: 1073741824      Blocks: 0          IO Block: 4096   regular file
> Device: 8,17    Inode: 410924125   Links: 1
> Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> Context: unconfined_u:object_r:user_home_t:s0
> Access: 2025-03-18 19:26:24.113181341 -0400
> Modify: 2025-03-18 19:27:35.161694301 -0400
> Change: 2025-03-18 19:27:35.161694301 -0400
>  Birth: 2025-03-18 19:25:12.988344235 -0400
> $ fallocate -z -l 1GiB foo
> $ stat foo
>   File: foo
>   Size: 1073741824      Blocks: 2097152    IO Block: 4096   regular file
> Device: 8,17    Inode: 410924125   Links: 1
> Access: (0644/-rw-r--r--)  Uid: (0/ root)   Gid: (0/ root)
> Context: unconfined_u:object_r:user_home_t:s0
> Access: 2025-03-18 19:26:24.113181341 -0400
> Modify: 2025-03-18 19:27:54.462817356 -0400
> Change: 2025-03-18 19:27:54.462817356 -0400
>  Birth: 2025-03-18 19:25:12.988344235 -0400
>
>
> Yes, I also realise that none of the above operations actually resulted
> in blocks being physically filled with data, but all modern flash based
> drives tend to have a log structured FTL. So while overwriting data in
> the HDD era meant that you would usually (unless you had a log based
> filesystem) overwrite the same physical space with data, today's drives
> are free to shift the rewritten block to any new physical location in
> order to ensure even wear levelling of the SSD.
Yea. The Wr_zero operation writes 0s to the logical block. Does that
guarantee there is no "old block for the logical block" that still holds
the data? (It does say Wr_zero can be used for secure erasure, but??)

Good question for which I don't have any idea what the answer is, rick

>
> IOW: there is no real advantage to physically writing out the data
> unless you have a peculiar interest in wasting time.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@xxxxxxxxxxxxxxx
>
>