On Tue, Mar 18, 2025 at 5:00 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Tue, 2025-03-18 at 16:52 -0700, Rick Macklem wrote: > > On Tue, Mar 18, 2025 at 4:40 PM Trond Myklebust > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > On Tue, 2025-03-18 at 23:37 +0100, Lionel Cons wrote: > > > > On Tue, 18 Mar 2025 at 22:17, Trond Myklebust > > > > <trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > > > > > On Tue, 2025-03-18 at 14:03 -0700, Rick Macklem wrote: > > > > > > > > > > > > The problem I see is that WRITE_SAME isn't defined in a way > > > > > > where > > > > > > the > > > > > > NFSv4 server can only implement zero'ng and fail the rest. > > > > > > As such. I am thinking that a new operation for NFSv4.2 that > > > > > > does > > > > > > writing > > > > > > of zeros might be preferable to trying to (mis)use > > > > > > WROTE_SAME. > > > > > > > > > > Why wouldn't you just implement DEALLOCATE? > > > > > > > > > > > > > Oh my god. > > > > > > > > NFSv4.2 DEALLOCATE creates a hole in a sparse file, and does NOT > > > > write zeros. > > > > > > > > "holes" in sparse files (as created by NFSV4.2 DEALLOCATE) > > > > represent > > > > areas of "no data here". For backwards compatibility these holes > > > > do > > > > not produce read errors, they just read as 0x00 bytes. But they > > > > represent ranges where just no data are stored. > > > > Valid data (from allocated data ranges) can be 0x00, but there > > > > are > > > > NOT > > > > holes, they represent VALID DATA. > > > > > > > > This is an important difference! > > > > For example if we have files, one per week, 700TB file size > > > > (100TB > > > > per > > > > day). Each of those files start as a completely unallocated space > > > > (one > > > > big hole). The data ranges are gradually allocated by writes, and > > > > the > > > > position of the writes in the files represent the time when they > > > > were > > > > collected. If no data were collected during that time that space > > > > remains unallocated (hole), so we can see whether someone > > > > collected > > > > data in that timeframe. > > > > > > > > Do you understand the difference? > > > > > > > > > > Yes. I do understand the difference, but in this case you're > > > literally > > > just talking about accounting. The sparse file created by > > > DEALLOCATE > > > does not need to allocate the blocks (except possibly at the > > > edges). If > > > you need to ensure that those empty blocks are allocated and > > > accounted > > > for, then a follow up call to ALLOCATE will do that for you. > > Unfortunately ZFS knows how to deallocate, but not how to allocate. > > So there is no support for the posix fallocate function? Well in the > worst case, your NFS server will just have to emulate it. The NFS server cannot emulate it (writing a block of zeros with write does not guarantee a future write of the same byte range won't fail with ENOSPACE. --> The FreeBSD NFSv4.2 server replies NFS4ERR_NOTSUPP. The bothersome part is "there is a hint in the NFSv4.2 RFC that support is "per server" and not "per file system". As such, the server replies NFS4ERR_NOTSUPP even if there is a mix of UFS and ZFS file systems exported. (UFS can do allocate.) Now, what happens when fallocate() is attempted on ZFS? It should fail, but I am not sure. If it succeeds, I would consider that a bug. > > > > > > > $ touch foo > > > $ stat foo > > > File: foo > > > Size: 0 Blocks: 0 IO Block: 4096 regular > > > empty file > > > Device: 8,17 Inode: 410924125 Links: 1 > > > Access: (0644/-rw-r--r--) Uid: (0/ root) Gid: (0/ root) > > > Context: unconfined_u:object_r:user_home_t:s0 > > > Access: 2025-03-18 19:26:24.113181341 -0400 > > > Modify: 2025-03-18 19:26:24.113181341 -0400 > > > Change: 2025-03-18 19:26:24.113181341 -0400 > > > Birth: 2025-03-18 19:25:12.988344235 -0400 > > > $ truncate -s 1GiB foo > > > $ stat foo > > > File: foo > > > Size: 1073741824 Blocks: 0 IO Block: 4096 regular > > > file > > > Device: 8,17 Inode: 410924125 Links: 1 > > > Access: (0644/-rw-r--r--) Uid: (0/ root) Gid: (0/ root) > > > Context: unconfined_u:object_r:user_home_t:s0 > > > Access: 2025-03-18 19:26:24.113181341 -0400 > > > Modify: 2025-03-18 19:27:35.161694301 -0400 > > > Change: 2025-03-18 19:27:35.161694301 -0400 > > > Birth: 2025-03-18 19:25:12.988344235 -0400 > > > $ fallocate -z -l 1GiB foo > > > $ stat foo > > > File: foo > > > Size: 1073741824 Blocks: 2097152 IO Block: 4096 regular > > > file > > > Device: 8,17 Inode: 410924125 Links: 1 > > > Access: (0644/-rw-r--r--) Uid: (0/ root) Gid: (0/ root) > > > Context: unconfined_u:object_r:user_home_t:s0 > > > Access: 2025-03-18 19:26:24.113181341 -0400 > > > Modify: 2025-03-18 19:27:54.462817356 -0400 > > > Change: 2025-03-18 19:27:54.462817356 -0400 > > > Birth: 2025-03-18 19:25:12.988344235 -0400 > > > > > > > > > Yes, I also realise that none of the above operations actually > > > resulted > > > in blocks being physically filled with data, but all modern flash > > > based > > > drives tend to have a log structured FTL. So while overwriting data > > > in > > > the HDD era meant that you would usually (unless you had a log > > > based > > > filesystem) overwrite the same physical space with data, today's > > > drives > > > are free to shift the rewritten block to any new physical location > > > in > > > order to ensure even wear levelling of the SSD. > > Yea. The Wr_zero operation writes 0s to the logical block. Does that > > guarantee there is no "old block for the logical block" that still > > holds > > the data? (It does say Wr_zero can be used for secure erasure, but??) > > > > Good question for which I don't have any idea what the answer is, > > rick > > In both the above arguments, you are talking about specific filesystem > implementation details that you'll also have to address with your new > operation. All the operation spec. needs to say is "writes zeros to the file byte range" and make it clear that the 0s written is data and not a hole. (As for whether or not there is hardware offload is a server implementation detail.) As for guarantees w.r.t. data being overwritten, I think that would be beyond what would be required. (Data erasure is an interesting but different topic for which I do not have any expertise.) rick > > > > > > > > > IOW: there is no real advantage to physically writing out the data > > > unless you have a peculiar interest in wasting time. > > > > > > -- > > > Trond Myklebust > > > Linux NFS client maintainer, Hammerspace > > > trond.myklebust@xxxxxxxxxxxxxxx > > > > > > > > > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > >