Some questions about per-ag metadata space reservations...

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 6 Sep 2017 20:30:54 +1000

Hi folks,

I've got a bit of a problem with the per-ag reservations we are
using at the moment. The existance of them is fine, but the
implementation is problematic for something I'm working on right
now.

I've been making a couple of mods to the filesystem to separate
physical space accounting from free space accounting to allow us to
optimise the filesystem for thinly provisioned devices. That is,
the filesystem is laid out as though it is the size of the
underlying device, but then free space is artificially limited. i.e.
we have a "physical size" of the filesystem and a "logical size"
that limits the amount of data and metadata that can actually be
stored in it.

When combined with a thinly provisioned device, this enables us to
shrink the XFS filesystem simply by running fstrim to punch all the
free space out of the underlying thin device and then adjusting the
free space down appropriately. Because the thin device abstracts the
physical location of the data in the block device away from the
address space presented to the filesystem, we don't need to move any
data or metadata to free up this space - it's just an accounting
change.

The problem arises with the per AG reservations in that they are
based on the physical size of the AG, which for a thin filesystem
will always be larger than the space available. e.g. we might
allocate a 32TB thin device to give 32x1TB AGs in the filesystem,
but we might only start by allocating 1TB of space to the
filesystem. e.g.:

# mkfs.xfs -f -m rmapbt=1,reflink=1 -d size=32t,thin=1t /dev/vdc
Default configuration sourced from package build definitions
meta-data=/dev/vdc               isize=512    agcount=32, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=0, rmapbt=1, reflink=1
data     =                       bsize=4096   blocks=268435456, imaxpct=5, thin=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
#

The issue is now when we mount it:

# mount /dev/vdc /mnt/scratch ; df -h /mnt/scratch/ ; sudo umount /mnt/scratch
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc       1023G  628G  395G  62% /mnt/scratch
#

Of that 1TB of space, we immediately remove 600+GB of free space for
finobt, rmapbt and reflink metadata reservations. This is based on
the physical size and number of AGs in the filesystem, so it always
gets removed from the free block count available to the user.
This is clearly seen when I grow the filesystem to 10x the size:

# xfs_growfs -D 2684354560 /mnt/scratch
....
data blocks changed from 268435456 to 2684354560
# df -h /mnt/scratch
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc         10T  628G  9.4T   7% /mnt/scratch
#

And also shows up on shrinking back down a chunk, too:

# xfs_growfs -D 468435456 /mnt/scratch
.....
data blocks changed from 2684354560 to 468435456
# df -h /mnt/scratch
Filesystem      Size  Used Avail Use% Mounted on
/dev/vdc        1.8T  628G  1.2T  36% /mnt/scratch
#

(Oh, did I mention I have working code and that's how I came across
this problem? :P)

For a normal filesystem, there's no problem with doing this brute
force physical reservation, though it is slightly disconcerting to
see a new, empty 100TB filesystem say it's got 2TB used and only
98TB free...

The issue is that for a thin filesystem, this space reservation
come out of the *logical* free space, not the physical free space.
With 1TB of thin space, we've got 31TB of /physical free space/ the
reservation can be taken out of without the user ever seeing it. The
question is this: how on earth do I do this?

I want the available space to match the "thin=size" value on the
mkfs command line, but I don't want metadata reservations to take
away from this space. metadata allocations need to be accounted to
the available space, but the reservations should not be. So how
should I go about providing these reservations? Do we even need them
to be accounted against free space in this case where we control the
filesysetm free blocks to be a /lot/ less than the physical space?

e.g. if I limit a thin filesystem to 95% of the underlying thin
device size, then we've always got a 5% space margin and so we don't
need to take the reservations out of the global free block counter
to ensure we always have physical space for the metadata. We still
take the per-ag reservations to ensure everything still works on the
physical side, we just don't pull the space from the free block
counter. I think this will work, but I'm not sure I've fully grokked
all the conditions the per-ag reservation is protecting against or
whether there's more accounting work needed deep in allocation code
to make it work correctly.

Thoughts, anyone?

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html