On Tue, Jul 30, 2024 at 07:03:50PM +0200, Florian Weimer wrote: > > At the very least, we should have a variant of ftruncate that never > truncates, likely under the fallocate umbrella. It seems that that's > how posix_fallocate is used sometimes, for avoiding SIGBUS with mmap. > To these use cases, whether extents are allocated or not does not > matter. Personally, what I advise any application authors I come across is simply tell them to avoid using posix_fallocate(2) altogether; the semantics are totally broken, as is common with anything mandated by a committee that was trying to satify multiple legacy Unix implementations. And so, relying on it just going to be fraught. What I tell them to do instead is to use the Linux fallocate(2) system call directly, which is well-defined, and if the file system doesn't support fallocate, and fallocate(2) returns ENOSPC, that the userspace application should either accept the fact it won't be able to allocate the space, or if it really needs to avoid things like the SIGBUS with mmap(2), to have the userspace application do the zero-fill writes itself. So honestly, is it worth it to try "fixing" posix_fallocate(2)? Just tell people to avoid it like the plague.... That way, we don't have to worry about breaking existing legacy applications. If we are going to stick with the existing Linux fallocate(2) system call, then the problem is trying to have the system mind-read about what the application writer really was trying to get when they call fallocate(2) --- are they trying to avoid SIGBUS with mmap? Or are they trying to guarantee that any writes to that file range will never fail with ENOSPC (even in the face of something like dm-thin being in the storage stack). And so the solution is simple; we can define new flag bits to the fallocate(2) system call to make it be explicit exactly what the application is requesting of the system. Adding new fallocate(2) flag bits seems to be a more general solution adding a new ftruncate(2) variant, In addition, we can also add a new flag which requests the file system passes the allocation request down to the thin provisioned storage (aassuming that this is something that is supported). Although I'm not sure how much this matters; after all, for decades there have been thin-provisioned NetApp storage appliances where fallocate(2) or posix_falloate(2) wouldn't necessarily guarantee a thin-provisioned device might run out of space on a write(2), and application authors seem to have been willing to live with it. Still, if people really want this to work, even in the face of a file system which supports copy-on-write cloned ranges, then presumably this new fallocate(2) system call with the "never shall a write fail with ENOSPC" bit set, can also snap the COW region as well. It's important, though, that this be done usinga new fallocate(2) flag, as opposed to have this magically be added to the existing fallocate(2) system call, since that will likely cause surprises for some applications. - Ted