On Tue, Jun 06 2023 at 10:01P -0400, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Sat, Jun 03, 2023 at 11:57:48AM -0400, Mike Snitzer wrote: > > On Fri, Jun 02 2023 at 8:52P -0400, > > Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > Mike, I think you might have misunderstood what I have been proposing. > > > Possibly unintentionally, I didn't call it REQ_OP_PROVISION but > > > that's what I intended - the operation does not contain data at all. > > > It's an operation like REQ_OP_DISCARD or REQ_OP_WRITE_ZEROS - it > > > contains a range of sectors that need to be provisioned (or > > > discarded), and nothing else. > > > > No, I understood that. > > > > > The write IOs themselves are not tagged with anything special at all. > > > > I know, but I've been looking at how to also handle the delalloc > > usecase (and yes I know you feel it doesn't need handling, the issue > > is XFS does deal nicely with ensuring it has space when it tracks its > > allocations on "thick" storage > > Oh, no it doesn't. It -works for most cases-, but that does not mean > it provides any guarantees at all. We can still get ENOSPC for user > data when delayed allocation reservations "run out". > > This may be news to you, but the ephemeral XFS delayed allocation > space reservation is not accurate. It contains a "fudge factor" > called "indirect length". This is a "wet finger in the wind" > estimation of how much new metadata will need to be allocated to > index the physical allocations when they are made. It assumes large > data extents are allocated, which is good enough for most cases, but > it is no guarantee when there are no large data extents available to > allocate (e.g. near ENOSPC!). > > And therein lies the fundamental problem with ephemeral range > reservations: at the time of reservation, we don't know how many > individual physical LBA ranges the reserved data range is actually > going to span. > > As a result, XFS delalloc reservations are a "close-but-not-quite" > reservation backed by a global reserve pool that can be dipped into > if we run out of delalloc reservation. If the reserve pool is then > fully depleted before all delalloc conversion completes, we'll still > give ENOSPC. The pool is sized such that the vast majority of > workloads will complete delalloc conversion successfully before the > pool is depleted. > > Hence XFS gives everyone the -appearance- that it deals nicely with > ENOSPC conditions, but it never provides a -guarantee- that any > accepted write will always succeed without ENOSPC. > > IMO, using this "close-but-not-quite" reservation as the basis of > space requirements for other layers to provide "won't ENOSPC" > guarantees is fraught with problems. We already know that it is > insufficient in important corner cases at the filesystem level, and > we also know that lower layers trying to do ephemeral space > reservations will have exactly the same problems providing a > guarantee. And these are problems we've been unable to engineer > around in the past, so the likelihood we can engineer around them > now or in the future is also very unlikely. Thanks for clarifying. Wasn't aware of XFS delalloc's "wet finger in the air" ;) So do you think it reasonable to require applications to fallocate their data files? Unaware if users are aware to take that extra step. > > -- so adding coordination between XFS > > and dm-thin layers provides comparable safety.. that safety is an > > expected norm). > > > > But rather than discuss in terms of data vs metadata, the distinction > > is: > > 1) LBA range reservation (normal case, your proposal) > > 2) non-LBA reservation (absolute value, LBA range is known at later stage) > > > > But I'm clearly going off script for dwelling on wanting to handle > > both. > > Right, because if we do 1) then we don't need 2). :) Sure. > > My looking at (ab)using REQ_META being set (use 1) vs not (use 2) was > > a crude simplification for branching between the 2 approaches. > > > > And I understand I made you nervous by expanding the scope to a much > > more muddled/shitty interface. ;) > > Nervous? No, I'm simply trying to make sure that everyone is on the > same page. i.e. that if we water down the guarantee that 1) relies > on, then it's not actually useful to filesystems at all. Yeah, makes sense. > > > Put simply: if we restrict REQ_OP_PROVISION guarantees to just > > > REQ_META writes (or any other specific type of write operation) then > > > it's simply not worth persuing at the filesystem level because the > > > guarantees we actually need just aren't there and the complexity of > > > discovering and handling those corner cases just isn't worth the > > > effort. > > > > Here is where I get to say: I think you misunderstood me (but it was > > my fault for not being absolutely clear: I'm very much on the same > > page as you and Joe; and your visions need to just be implemented > > ASAP). > > OK, good that we've clarified the misunderstandings on both sides > quickly :) Do you think you're OK to scope out, and/or implement, the XFS changes if you use v7 of this patchset as the starting point? (v8 should just be v7 minus the dm-thin.c and dm-snap.c changes). The thinp support in v7 will work enough to allow XFS to issue REQ_OP_PROVISION and/or fallocate (via mkfs.xfs) to dm-thin devices. And Joe and I can make independent progress on the dm-thin.c changes needed to ensure the REQ_OP_PROVISION gaurantee you need. Thanks, Mike