Re: [PATCH v7 0/5] Introduce provisioning primitives

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 3 Jun 2023 10:52:14 +1000

On Fri, Jun 02, 2023 at 11:44:27AM -0700, Sarthak Kukreti wrote:
> On Tue, May 30, 2023 at 8:28 AM Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
> >
> > On Tue, May 30 2023 at 10:55P -0400,
> > Joe Thornber <thornber@xxxxxxxxxx> wrote:
> >
> > > On Tue, May 30, 2023 at 3:02 PM Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
> > >
> > > >
> > > > Also Joe, for you proposed dm-thinp design where you distinquish
> > > > between "provision" and "reserve": Would it make sense for REQ_META
> > > > (e.g. all XFS metadata) with REQ_PROVISION to be treated as an
> > > > LBA-specific hard request?  Whereas REQ_PROVISION on its own provides
> > > > more freedom to just reserve the length of blocks? (e.g. for XFS
> > > > delalloc where LBA range is unknown, but dm-thinp can be asked to
> > > > reserve space to accomodate it).
> > > >
> > >
> > > My proposal only involves 'reserve'.  Provisioning will be done as part of
> > > the usual io path.
> >
> > OK, I think we'd do well to pin down the top-level block interfaces in
> > question. Because this patchset's block interface patch (2/5) header
> > says:
> >
> > "This patch also adds the capability to call fallocate() in mode 0
> > on block devices, which will send REQ_OP_PROVISION to the block
> > device for the specified range,"
> >
> > So it wires up blkdev_fallocate() to call blkdev_issue_provision(). A
> > user of XFS could then use fallocate() for user data -- which would
> > cause thinp's reserve to _not_ be used for critical metadata.

Mike, I think you might have misunderstood what I have been proposing.
Possibly unintentionally, I didn't call it REQ_OP_PROVISION but
that's what I intended - the operation does not contain data at all.
It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it
contains a range of sectors that need to be provisioned (or
discarded), and nothing else. The write IOs themselves are not
tagged with anything special at all.

i.e. The proposal I made does not use REQ_PROVISION anywhere in the
metadata/data IO path; provisioned regions are created by separate
operations and must be tracked by the underlying block device, then
treat any write IO to those regions as "must not fail w/ ENOSPC"
IOs.

There seems to be a lot of fear about user data requiring
provisioning. This is unfounded - provisioning is only needed for
explicitly provisioned space via fallocate(), not every byte of
user data written to the filesystem (the model Brian is proposing).

Excessive use of fallocate() is self correcting - if users and/or
their applications provision too much, they are going to get ENOSPC
or have to pay more to expand the backing pool reserves they need.
But that's not a problem the block device should be trying to solve;
that's a problem for the sysadmin and/or bean counters to address.

> >
> > The only way to distinquish the caller (between on-behalf of user data
> > vs XFS metadata) would be REQ_META?
> >
> > So should dm-thinp have a REQ_META-based distinction? Or just treat
> > all REQ_OP_PROVISION the same?
> >
> I'm in favor of a REQ_META-based distinction.

Why? What *requirement* is driving the need for this distinction?

As the person who proposed this new REQ_OP_PROVISION architecture,
I'm dead set against it.  Allowing the block device provide a set of
poorly defined "conditional guarantees" policies instead of a
mechanism with a single ironclad guarantee defeats the entire
purpose of the proposal. 

We have a requirement from the *kernel ABI* that *user data writes*
must not fail with ENOSPC after an fallocate() operation.  That's
one of the high level policies we need to implement. The filesystem
is already capable of guaranteeing it won't give the user ENOSPC
after fallocate, we now need a guarantee from the filesystem's
backing store that it won't give ENOSPC, too.

The _other thing_ we need to implement is a method of guaranteeing
the filesystem won't shut down when the backing device goes ENOSPC
unexpected during metadata writeback.  So we also need the backing
device to guarantee the regions we write metadata to won't give
ENOSPC.

That's the whole point of REQ_OP_PROVISION: from the layers above
the block device, there is -zero- difference between the guarantee
we need for user data writes to avoid ENOSPC and for metadata writes
to avoid ENOSPC. They are one and the same.

Hence if the block device is going to say "I support provisioning"
but then give different conditional guarantees according to the
*type of data* in the IO request, then it does not provide the
functionality the higher layers actually require from it.

Indeed, what type of data the IO contains is *context dependent*.
For example, sometimes we write metadata with user data IO and but
we still need provisioning guarantees as if it was issued as
metadata IO. This is the case for mkfs initialising the file system
by writing directly to the block device.

IOWs, filesystem metadata IO issued from kernel context would be
considered metadata IO, but from userspace it would be considered
normal user data IO and hence treated differently. But the reality
is that they both need the same provisioning guarantees to be
provided by the block device.

So how do userspace tools deal with this if the block device
requires REQ_META on user data IOs to do the right thing here? And
if we provide a mechanism to allow this, how do we prevent userspace
for always using it on writes to fallocate() provisioned space?

It's just not practical for the block device to add arbitrary
constraints based on the type of IO because we then have to add
mechanisms to userspace APIs to allow them to control the IO context
so the block device will do the right thing. Especially considering
we really only need one type of guarantee regardless of where the IO
originates from or what type of data the IO contains....

> Does that imply that
> REQ_META also needs to be passed through the block/filesystem stack
> (eg. REQ_OP_PROVION + REQ_META on a loop device translates to a
> fallocate(<insert meta flag name>) to the underlying file)?

This is exactly the same case as above: the loopback device does
user data IO to the backing file. Hence we have another situation
where metadata IO is issued to fallocate()d user data ranges as user
data ranges and so would be given a lesser guarantee that would lead
to upper filesystem failure. BOth upper and lower filesystem data
and metadata need to be provided the same ENOSPC guarantees by their
backing stores....

The whole point of the REQ_OP_PROVISION proposal I made is that it
doesn't require any special handling in corner cases like this.
There are no cross-layer interactions needed to make everything work
correctly because the provisioning guarantee is not -data type
dependent*. The entire user IO path code remains untouched and
blissfully unaware of provisioned regions.

And, realistically, if we have to start handling complex corner
cases in the filesystem and IO path layers to make REQ_OP_PROVISION
work correctly because of arbitary constraints imposed by the block
layer implementations, then we've failed miserably at the design and
architecture stage.

Keep in mind that every attempt made so far to address the problems
with block device ENOSPC errors has failed because of the complexity
of the corner cases that have arisen during design and/or
implementation. It's pretty ironic that now we have a proposal that
is remarkably simple, free of corner cases and has virtually no
cross-layer coupling at all, the first thing that people want to do
is add arbitrary implementation constraints that result in complex
cross-layer corner cases that now need to be handled....

Put simply: if we restrict REQ_OP_PROVISION guarantees to just
REQ_META writes (or any other specific type of write operation) then
it's simply not worth persuing at the filesystem level because the
guarantees we actually need just aren't there and the complexity of
discovering and handling those corner cases just isn't worth the
effort.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx