On 21/09/21 10:44AM, Dave Chinner wrote: > On Fri, Sep 17, 2021 at 06:31:01PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > Add a new mode to fallocate to zero-initialize all the storage backing a > > file. > > > > Signed-off-by: Darrick J. Wong <djwong@xxxxxxxxxx> > > --- > > fs/open.c | 5 +++++ > > include/linux/falloc.h | 1 + > > include/uapi/linux/falloc.h | 9 +++++++++ > > 3 files changed, 15 insertions(+) > > > > > > diff --git a/fs/open.c b/fs/open.c > > index daa324606a41..230220b8f67a 100644 > > --- a/fs/open.c > > +++ b/fs/open.c > > @@ -256,6 +256,11 @@ int vfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) > > (mode & ~FALLOC_FL_INSERT_RANGE)) > > return -EINVAL; > > > > + /* Zeroinit should only be used by itself and keep size must be set. */ > > + if ((mode & FALLOC_FL_ZEROINIT_RANGE) && > > + (mode != (FALLOC_FL_ZEROINIT_RANGE | FALLOC_FL_KEEP_SIZE))) > > + return -EINVAL; > > + > > /* Unshare range should only be used with allocate mode. */ > > if ((mode & FALLOC_FL_UNSHARE_RANGE) && > > (mode & ~(FALLOC_FL_UNSHARE_RANGE | FALLOC_FL_KEEP_SIZE))) > > diff --git a/include/linux/falloc.h b/include/linux/falloc.h > > index f3f0b97b1675..4597b416667b 100644 > > --- a/include/linux/falloc.h > > +++ b/include/linux/falloc.h > > @@ -29,6 +29,7 @@ struct space_resv { > > FALLOC_FL_PUNCH_HOLE | \ > > FALLOC_FL_COLLAPSE_RANGE | \ > > FALLOC_FL_ZERO_RANGE | \ > > + FALLOC_FL_ZEROINIT_RANGE | \ > > FALLOC_FL_INSERT_RANGE | \ > > FALLOC_FL_UNSHARE_RANGE) > > > > diff --git a/include/uapi/linux/falloc.h b/include/uapi/linux/falloc.h > > index 51398fa57f6c..8144403b6102 100644 > > --- a/include/uapi/linux/falloc.h > > +++ b/include/uapi/linux/falloc.h > > @@ -77,4 +77,13 @@ > > */ > > #define FALLOC_FL_UNSHARE_RANGE 0x40 > > > > +/* > > + * FALLOC_FL_ZEROINIT_RANGE is used to reinitialize storage backing a file by > > + * writing zeros to it. Subsequent read and writes should not fail due to any > > + * previous media errors. Blocks must be not be shared or require copy on > > + * write. Holes and unwritten extents are left untouched. This mode must be > > + * used with FALLOC_FL_KEEP_SIZE. > > + */ > > +#define FALLOC_FL_ZEROINIT_RANGE 0x80 > > Hmmmm. > > I think this wants to be a behavioural modifier for existing > operations rather than an operation unto itself. i.e. similar to how > KEEP_SIZE modifies ALLOC behaviour but doesn't fundamentally alter > the guarantees ALLOC provides userspace. > > In this case, the change of behaviour over ZERO_RANGE is that we > want physical zeros to be written instead of the filesystem > optimising away the physical zeros by manipulating the layout > of the file. > > There's been requests in the past for a way to make ALLOC also > behave like this - in the case that users want fully allocated space > to be preallocated so their applications don't take unwritten extent > conversion penalties on first writes. Databases are an example here, > where setup of a new WAL file isn't performance critical, but writes > to the WAL are and the WAL files are write-once. Hence they always > take unwritten conversion penalties and the only way around that is > to physically zero the files before use... > > So it seems to me what we actually need here is a "write zeroes" > modifier to fallocate() operations to tell the filesystem that the > application really wants it to write zeroes over that range, not > just guarantee space has been physically allocated.... > > Then we have and API that looks like: > > ALLOC - allocate space efficiently > ALLOC | INIT - allocate space by writing zeros to it > ZERO - zero data and preallocate space efficiently > ZERO | INIT - zero range by writing zeros to it > > Which seems to cater for all the cases I know of where physically > writing zeros instead of allocating unwritten extents is the > preferred behaviour of fallocate().... > If that's the case we can just have FALLOC_FL_ZEROWRITE_RANGE? Where FALLOC_FL_ZERO_RANGE & FALLOC_FL_ZEROWRITE_RANGE are mutually exclusive. AFAIU, /* FALLOC_FL_ZERO_RANGE may optimize the underlying blocks with unwritten * extents if the filesystem allows so, but with FALLOC_FL_ZEROWRITE_RANGE, * the underlying blocks are guranteed to be written with zeros. * In case of hole it will be preallocated with written extents and will be * initialized with zeroes. If FALLOC_FL_KEEP_SIZE is specified then the * inode size will remain the same. * * Essentially similar to FALLOC_FL_ZERO_RANGE but with gurantees that * underlying storage has written extents initialized with zeroes. */ #define FALLOC_FL_ZEROWRITE_RANGE 0x80 Does that make sense? -ritesh