On Wed, Jan 06, 2021 at 03:40:09PM -0800, Andres Freund wrote: > Hi, > > On 2021-01-07 09:52:01 +1100, Dave Chinner wrote: > > On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote: > > > Which brings me to $subject: > > > > > > Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that > > > doesn't convert extents into unwritten extents, but instead uses > > > blkdev_issue_zeroout() if supported? Mostly interested in xfs/ext4 > > > myself, but ... > > > > We have explicit requests from users (think initialising large VM > > images) that FALLOC_FL_ZERO_RANGE must never fall back to writing > > zeroes manually. > > That behaviour makes a lot of sense for quite a few use cases - I wasn't > trying to make it sound like it should not be available. Nor that > FALLOC_FL_ZERO_RANGE should behave differently. > > > > IOWs, while you might want FALLOC_FL_ZERO_RANGE to explicitly write > > zeros, we have users who explicitly don't want it to do this. > > Right - which is why I was asking for a variant of FALLOC_FL_ZERO_RANGE > (jokingly named FALLOC_FL_ZERO_RANGE_BUT_REALLY in the subject), rather > than changing the behaviour. > > > > Perhaps we should add want FALLOC_FL_CONVERT_RANGE, which tells the > > filesystem to convert an unwritten range of zeros to a written range > > by manually writing zeros. i.e. you do FALLOC_FL_ZERO_RANGE to zero > > the range and fill holes using metadata manipulation, followed by > > FALLOC_FL_WRITE_RANGE to then convert the "metadata zeros" to real > > written zeros. > > Yep, something like that would do the trick. Perhaps > FALLOC_FL_MATERIALIZE_RANGE? [ FWIW, I really dislike the "RANGE" part of fallocate flag names. It's redundant (fallocate always operates on a range!) and just makes names unnecessarily longer. ] I used "convert range" as the name explicitly because it has specific meaning for extent space manipulation. i.e. we "convert" extents from one state to another. "write range" is also has explicit meaning, in that it will convert extents from unwritten to written data. In comparison, "materialise" is something undefined, and could be easily thought to take something ephemeral (such as a hole) and turn it into something real (an allocated extent). We wouldn't want this operation to allocate space, so I think "materialise" is just too much magic to encoding into an API for an explicit, well defined state change. We also have people asking for ZERO_RANGE to just flip existing extents from written to unwritten (rather than the punch/preallocate we do now). This is also a "convert" operation, just in the other direction (from data to zeros rather than from zeros to data). The observation I'm making here is that these "convert" oeprations will both makes SEEK_HOLE/SEEK_DATA behave differently for the underlying data. preallocated space is considered a HOLE, written zeros are considered DATA. So we do expose the ability to check that a "convert" operation has actually changed the state of the underlying extents in either direction... CONVERT_TO_DATA/CONVERT_TO_ZERO as an operational pair whose behaviour is visible and easily testable via SEEK_HOLE/SEEK_DATA makes a lot more sense to me. Also defining them to fail fast if unwritten extents are not supported by the filesystem (i.e. they should -never- physically write anything) would also allow applications to fall back to ZERO_RANGE on filesystems that don't support unwritten extents to explicitly write zeros if CONVERT_TO_ZERO fails.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx