Re: fallocate(FALLOC_FL_ZERO_RANGE_BUT_REALLY) to avoid unwritten extents?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jan 06, 2021 at 03:40:09PM -0800, Andres Freund wrote:
> Hi,
> 
> On 2021-01-07 09:52:01 +1100, Dave Chinner wrote:
> > On Tue, Dec 29, 2020 at 10:28:19PM -0800, Andres Freund wrote:
> > > Which brings me to $subject:
> > > 
> > > Would it make sense to add a variant of FALLOC_FL_ZERO_RANGE that
> > > doesn't convert extents into unwritten extents, but instead uses
> > > blkdev_issue_zeroout() if supported?  Mostly interested in xfs/ext4
> > > myself, but ...
> > 
> > We have explicit requests from users (think initialising large VM
> > images) that FALLOC_FL_ZERO_RANGE must never fall back to writing
> > zeroes manually.
> 
> That behaviour makes a lot of sense for quite a few use cases - I wasn't
> trying to make it sound like it should not be available. Nor that
> FALLOC_FL_ZERO_RANGE should behave differently.
> 
> 
> > IOWs, while you might want FALLOC_FL_ZERO_RANGE to explicitly write
> > zeros, we have users who explicitly don't want it to do this.
> 
> Right - which is why I was asking for a variant of FALLOC_FL_ZERO_RANGE
> (jokingly named FALLOC_FL_ZERO_RANGE_BUT_REALLY in the subject), rather
> than changing the behaviour.
> 
> 
> > Perhaps we should add want FALLOC_FL_CONVERT_RANGE, which tells the
> > filesystem to convert an unwritten range of zeros to a written range
> > by manually writing zeros. i.e. you do FALLOC_FL_ZERO_RANGE to zero
> > the range and fill holes using metadata manipulation, followed by
> > FALLOC_FL_WRITE_RANGE to then convert the "metadata zeros" to real
> > written zeros.
> 
> Yep, something like that would do the trick. Perhaps
> FALLOC_FL_MATERIALIZE_RANGE?

[ FWIW, I really dislike the "RANGE" part of fallocate flag names.
It's redundant (fallocate always operates on a range!) and just
makes names unnecessarily longer. ]

I used "convert range" as the name explicitly because it has
specific meaning for extent space manipulation. i.e. we "convert"
extents from one state to another. "write range" is also has
explicit meaning, in that it will convert extents from unwritten to
written data.

In comparison, "materialise" is something undefined, and could be
easily thought to take something ephemeral (such as a hole) and turn
it into something real (an allocated extent). We wouldn't want this
operation to allocate space, so I think "materialise" is just too
much magic to encoding into an API for an explicit, well defined
state change.

We also have people asking for ZERO_RANGE to just flip existing
extents from written to unwritten (rather than the punch/preallocate
we do now). This is also a "convert" operation, just in the other
direction (from data to zeros rather than from zeros to data).

The observation I'm making here is that these "convert" oeprations
will both makes SEEK_HOLE/SEEK_DATA behave differently for the
underlying data. preallocated space is considered a HOLE, written
zeros are considered DATA. So we do expose the ability to check that
a "convert" operation has actually changed the state of the
underlying extents in either direction...

CONVERT_TO_DATA/CONVERT_TO_ZERO as an operational pair whose
behaviour is visible and easily testable via SEEK_HOLE/SEEK_DATA
makes a lot more sense to me. Also defining them to fail fast if
unwritten extents are not supported by the filesystem (i.e. they
should -never- physically write anything) would also allow
applications to fall back to ZERO_RANGE on filesystems that don't
support unwritten extents to explicitly write zeros if
CONVERT_TO_ZERO fails....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx



[Index of Archives]     [Reiser Filesystem Development]     [Ceph FS]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite National Park]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]     [Linux Media]

  Powered by Linux