Re: [PATCH v7 0/5] Introduce provisioning primitives

Dave Chinner <david@xxxxxxxxxxxxx> · Sat, 27 May 2023 09:45:02 +1000

On Fri, May 26, 2023 at 12:04:02PM +0100, Joe Thornber wrote:
> Here's my take:
> 
> I don't see why the filesystem cares if thinp is doing a reservation or
> provisioning under the hood.  All that matters is that a future write
> to that region will be honoured (barring device failure etc.).
> 
> I agree that the reservation/force mapped status needs to be inherited
> by snapshots.
> 
> 
> One of the few strengths of thinp is the performance of taking a snapshot.
> Most snapshots created are never activated.  Many other snapshots are
> only alive for a brief period, and used read-only.  eg, blk-archive
> (https://github.com/jthornber/blk-archive) uses snapshots to do very
> fast incremental backups.  As such I'm strongly against any scheme that
> requires provisioning as part of the snapshot operation.
> 
> Hank and I are in the middle of the range tree work which requires a
> metadata
> change.  So now is a convenient time to piggyback other metadata changes to
> support reservations.
> 
> 
> Given the above this is what I suggest:
> 
> 1) We have an api (ioctl, bio flag, whatever) that lets you
> reserve/guarantee a region:
> 
>   int reserve_region(dev, sector_t begin, sector_t end);

A C-based interface is not sufficient because the layer that must do
provsioning is not guaranteed to be directly under the filesystem.
We must be able to propagate the request down to the layers that
need to provision storage, and that includes hardware devices.

e.g. dm-thin would have to issue REQ_PROVISION on the LBA ranges it
allocates in it's backing device to guarantee that the provisioned
LBA range it allocates is also fully provisioned by the storage
below it....

>   This api should be used minimally, eg, critical FS metadata only.

Keep in mind that "critical FS metadata" in this context is any
metadata which could cause the filesystem to hang or enter a global
error state if an unexpected ENOSPC error occurs during a metadata
write IO.

Which, in pretty much every journalling filesystem, equates to all
metadata in the filesystem. For a typical root filesystem, that
might be a in the range of a 1-200MB (depending on journal size).
For larger filesytems with lots of files in them, it will be in the
range of GBs of space.

Plan for having to support tens of GBs of provisioned space in
filesystems, not tens of MBs....

[snip]

> Now this is a lot of work.  As well as the kernel changes we'll need to
> update the userland tools: thin_check, thin_ls, thin_metadata_unpack,
> thin_rmap, thin_delta, thin_metadata_pack, thin_repair, thin_trim,
> thin_dump, thin_metadata_size, thin_restore.  Are we confident that we
> have buy in from the FS teams that this will be widely adopted?  Are users
> asking for this?  I really don't want to do 6 months of work for nothing.

I think there's a 2-3 solid days of coding to fully implement
REQ_PROVISION support in XFS, including userspace tool support.
Maybe a couple of weeks more to flush the bugs out before it's
largely ready to go.

So if there's buy in from the block layer and DM people for
REQ_PROVISION as described, then I'll definitely have XFS support
ready for you to test whenever dm-thinp is ready to go.

I can't speak for other filesystems, I suspect the only one we care
about is ext4.  btrfs and f2fs don't need dm-thinp and there aren't
any other filesystems that are used in production on top of
dm-thinp, so I think only XFS and ext4 matter at this point in time.

I suspect that ext4 would be fairly easy to add support for as well.
ext4 has a lot more fixed-place metadata than XFS has so much more
of it's metadata is covered by mkfs-time provisioning. Limiting
dynamic metadata to specific fully provisioned block groups and
provisioning new block groups for metadata when they are near full
would be equivalent to how I plan to provision metadata space in
XFS. Hence the implementation for ext4 looks to be broadly similar
in scope and complexity as XFS....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx