Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 25 May 2022 17:33:39 +1000

On Tue, May 24, 2022 at 08:36:50AM +0300, Amir Goldstein wrote:
> On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote:
> > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@xxxxxxxxx> wrote:
> > > >
> > > > XFS does not check for possible overflow of per-inode extent counter
> > > > fields when adding extents to either data or attr fork.
> > > >
> > > > For e.g.
> > > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
> > > >    then delete 50% of them in an alternating manner.
> > > >
> > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511
> > > >    extents to be created in the attr fork of the inode.
> > > >
> > > >    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
> > > >
> > > > 3. The incore inode fork extent counter is a signed 32-bit
> > > >    quantity. However, the on-disk extent counter is an unsigned 16-bit
> > > >    quantity and hence cannot hold 98511 extents.
> > > >
> > > > 4. The following incorrect value is stored in the xattr extent counter,
> > > >    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
> > > >    core.naextents = -32561
> > > >
> > > > This patchset adds a new helper function
> > > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
> > > > per-inode data and xattr extent counters and invokes it before
> > > > starting an fs operation (e.g. creating a new directory entry). With
> > > > this patchset applied, XFS detects counter overflows and returns with
> > > > an error rather than causing a silent corruption.
> > > >
> > > > The patchset has been tested by executing xfstests with the following
> > > > mkfs.xfs options,
> > > > 1. -m crc=0 -b size=1k
> > > > 2. -m crc=0 -b size=4k
> > > > 3. -m crc=0 -b size=512
> > > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > > >
> > > > The patches can also be obtained from
> > > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
> > > >
> > > > I have two patches that define the newly introduced error injection
> > > > tags in xfsprogs
> > > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@xxxxxxxxx/).
> > > >
> > > > I have also written tests
> > > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
> > > > for verifying the checks introduced in the kernel.
> > > >
> > >
> > > Hi Chandan and XFS folks,
> > >
> > > As you may have heard, I am working on producing a series of
> > > xfs patches for stable v5.10.y.
> > >
> > > My patch selection is documented at [1].
> > > I am in the process of testing the backport patches against the 5.10.y
> > > baseline using Luis' kdevops [2] fstests runner.
> > >
> > > The configurations that we are testing are:
> > > 1. -m rmbat=0,reflink=1 -b size=4k (default)
> > > 2. -m crc=0 -b size=4k
> > > 3. -m crc=0 -b size=512
> > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > >
> > > This patch set is the only largish series that I selected, because:
> > > - It applies cleanly to 5.10.y
> > > - I evaluated it as low risk and high value
> >
> > What value does it provide LTS users?
> >
> 
> Cloud providers deploy a large number of VMs/containers
> and they may use reflink. So I think this could be an issue.

Cloud providers are not deploying multi-TB VM images on XFS without
also using some mechanism for avoiding worst-case fragmentation.
They know all about the problems that manifest when extent
counts get into the tens of millions, let alone billions....

e.g. first access to a file pulls the entire extent list into
memory, so for a file with 4 billion extents this will take hours to
pull into memory (single threaded, synchronous read IO of millions
of filesystem blocks) and consume and consume >100GB of RAM for the
in-memory extent list. Having VM startup get delayed by hours and
put a massive load on the cloud storage infrastructure for that
entire length of time isn't desirable behaviour...

For multi-TB VM image deployment - especially with reflink on the
image file - extent size hints are needed to mitigate worst case
fragmentation.  Reflink copies can run at up to about 100,000
extents/s, so if you reflink a file with 4 billion extents in it,
not only do you need another 100GB RAM, you also need to wait
several hours for the reflink to run. And while that reflink is
running, nothing else has access the data in that VM image: your VM
is *down* for *hours* while you snapshot it.

Typical mitigation is extent size hints in the MB ranges to reduce
worst case fragmentation by two orders of magnitude (i.e. limit to
tens of millions of extents, not billions) which brings snapshot
times down to a minute or two. 

IOWs, it's obviously not practical to scale VM images out to
billions of extents, even though we support extent counts in the
billions.

> > This series adds almost no value to normal users - extent count
> > overflows are just something that doesn't happen in production
> > systems at this point in time. The largest data extent count I've
> > ever seen is still an order of magnitude of extents away from
> > overflowing (i.e. 400 million extents seen, 4 billion to overflow),
> > and nobody is using the attribute fork sufficiently hard to overflow
> > 65536 extents (typically a couple of million xattrs per inode).
> >
> > i.e. this series is ground work for upcoming internal filesystem
> > functionality that require much larger attribute forks (parent
> > pointers and fsverity merkle tree storage) to be supported, and
> > allow scope for much larger, massively fragmented VM image files
> > (beyond 16TB on 4kB block size fs for worst case
> > fragmentation/reflink).
> 
> I am not sure I follow this argument.
> Users can create large attributes, can they not?

Sure. But *nobody does*, and there are good reasons we don't see
people doing this.

The reality is that apps don't use xattrs heavily because
filesystems are traditionally very bad at storing even moderate
numbers of xattrs. XFS is the exception to the rule. Hence nobody is
trying to use a few million xattrs per inode right now, and it's
unlikely anyone will unless they specifically target XFS.  In which
case, they are going to want the large extent count stuff that just
got merged into the for-next tree, and this whole discussion is
moot....

> And users can create massive fragmented/reflinked images, can they not?

Yes, and they will hit scalability problems long before they get
anywhere near 4 billion extents.

> If we have learned anything, is that if users can do something (i.e. on stable),
> users will do that, so it may still be worth protecting this workflow?

If I have learned anything, it's that huge extent counts are highly
impractical for most workloads for one reason or another. We are a
long way for enabling practical use of extent counts in the
billions. Demand paging the extent list is the bare minimum we need,
but then there's sheer scale of modifications reflink and unlink
need to make (billions of transactions to share/free billions of
individual extents) and there's no magic solution to that. 

> I argue that the reason that you did not see those constructs in the wild yet,
> is the time it takes until users format new xfs filesystems with mkfs

It really has nothing to do with filesystem formats and everything
to do with the *cost* of creating, accessing, indexing and managing
billions of extents.

Have you ever tried to create a file with 4 billion extents in it?
Even using fallocate to do it as fast as possible (no data IO!), I
ran out of RAM on my 128GB test machine after 6 days of doing
nothing but running fallocate() on a single inode. The kernel died a
horrible OOM killer death at around 2.5 billion extents because the
extent list cannot be reclaimed from memory while the inode is in
use and the kernel ran out of all other memory it could reclaim as
the extent list grew.

The only way to fix that is to make the extent lists reclaimable
(i.e. demand paging of the in-memory extent list) and that's a big
chunk of work that isn't on anyone's radar right now.

> Given your inputs, I am not sure that the fix has high value, but I must
> say I didn't fully understand your argument.
> It sounded like
> "We don't need the fix because we did not see the problem yet",
> but I may have misunderstood you.

I hope you now realise that there are much bigger practical
scalability limitations with extent lists and reflink that will
manifest in production systems long before we get anywhere near
billions of extents per inode.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx