Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 25 May 2022 10:48:09 +0300

On Wed, May 25, 2022 at 10:33 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Tue, May 24, 2022 at 08:36:50AM +0300, Amir Goldstein wrote:
> > On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > >
> > > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote:
> > > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@xxxxxxxxx> wrote:
> > > > >
> > > > > XFS does not check for possible overflow of per-inode extent counter
> > > > > fields when adding extents to either data or attr fork.
> > > > >
> > > > > For e.g.
> > > > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
> > > > >    then delete 50% of them in an alternating manner.
> > > > >
> > > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511
> > > > >    extents to be created in the attr fork of the inode.
> > > > >
> > > > >    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
> > > > >
> > > > > 3. The incore inode fork extent counter is a signed 32-bit
> > > > >    quantity. However, the on-disk extent counter is an unsigned 16-bit
> > > > >    quantity and hence cannot hold 98511 extents.
> > > > >
> > > > > 4. The following incorrect value is stored in the xattr extent counter,
> > > > >    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
> > > > >    core.naextents = -32561
> > > > >
> > > > > This patchset adds a new helper function
> > > > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
> > > > > per-inode data and xattr extent counters and invokes it before
> > > > > starting an fs operation (e.g. creating a new directory entry). With
> > > > > this patchset applied, XFS detects counter overflows and returns with
> > > > > an error rather than causing a silent corruption.
> > > > >
> > > > > The patchset has been tested by executing xfstests with the following
> > > > > mkfs.xfs options,
> > > > > 1. -m crc=0 -b size=1k
> > > > > 2. -m crc=0 -b size=4k
> > > > > 3. -m crc=0 -b size=512
> > > > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > > > >
> > > > > The patches can also be obtained from
> > > > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
> > > > >
> > > > > I have two patches that define the newly introduced error injection
> > > > > tags in xfsprogs
> > > > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@xxxxxxxxx/).
> > > > >
> > > > > I have also written tests
> > > > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
> > > > > for verifying the checks introduced in the kernel.
> > > > >
> > > >
> > > > Hi Chandan and XFS folks,
> > > >
> > > > As you may have heard, I am working on producing a series of
> > > > xfs patches for stable v5.10.y.
> > > >
> > > > My patch selection is documented at [1].
> > > > I am in the process of testing the backport patches against the 5.10.y
> > > > baseline using Luis' kdevops [2] fstests runner.
> > > >
> > > > The configurations that we are testing are:
> > > > 1. -m rmbat=0,reflink=1 -b size=4k (default)
> > > > 2. -m crc=0 -b size=4k
> > > > 3. -m crc=0 -b size=512
> > > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > > >
> > > > This patch set is the only largish series that I selected, because:
> > > > - It applies cleanly to 5.10.y
> > > > - I evaluated it as low risk and high value
> > >
> > > What value does it provide LTS users?
> > >
> >
> > Cloud providers deploy a large number of VMs/containers
> > and they may use reflink. So I think this could be an issue.
>
> Cloud providers are not deploying multi-TB VM images on XFS without
> also using some mechanism for avoiding worst-case fragmentation.
> They know all about the problems that manifest when extent
> counts get into the tens of millions, let alone billions....
>
> e.g. first access to a file pulls the entire extent list into
> memory, so for a file with 4 billion extents this will take hours to
> pull into memory (single threaded, synchronous read IO of millions
> of filesystem blocks) and consume and consume >100GB of RAM for the
> in-memory extent list. Having VM startup get delayed by hours and
> put a massive load on the cloud storage infrastructure for that
> entire length of time isn't desirable behaviour...
>
> For multi-TB VM image deployment - especially with reflink on the
> image file - extent size hints are needed to mitigate worst case
> fragmentation.  Reflink copies can run at up to about 100,000
> extents/s, so if you reflink a file with 4 billion extents in it,
> not only do you need another 100GB RAM, you also need to wait
> several hours for the reflink to run. And while that reflink is
> running, nothing else has access the data in that VM image: your VM
> is *down* for *hours* while you snapshot it.
>
> Typical mitigation is extent size hints in the MB ranges to reduce
> worst case fragmentation by two orders of magnitude (i.e. limit to
> tens of millions of extents, not billions) which brings snapshot
> times down to a minute or two.
>
> IOWs, it's obviously not practical to scale VM images out to
> billions of extents, even though we support extent counts in the
> billions.
>
> > > This series adds almost no value to normal users - extent count
> > > overflows are just something that doesn't happen in production
> > > systems at this point in time. The largest data extent count I've
> > > ever seen is still an order of magnitude of extents away from
> > > overflowing (i.e. 400 million extents seen, 4 billion to overflow),
> > > and nobody is using the attribute fork sufficiently hard to overflow
> > > 65536 extents (typically a couple of million xattrs per inode).
> > >
> > > i.e. this series is ground work for upcoming internal filesystem
> > > functionality that require much larger attribute forks (parent
> > > pointers and fsverity merkle tree storage) to be supported, and
> > > allow scope for much larger, massively fragmented VM image files
> > > (beyond 16TB on 4kB block size fs for worst case
> > > fragmentation/reflink).
> >
> > I am not sure I follow this argument.
> > Users can create large attributes, can they not?
>
> Sure. But *nobody does*, and there are good reasons we don't see
> people doing this.
>
> The reality is that apps don't use xattrs heavily because
> filesystems are traditionally very bad at storing even moderate
> numbers of xattrs. XFS is the exception to the rule. Hence nobody is
> trying to use a few million xattrs per inode right now, and it's
> unlikely anyone will unless they specifically target XFS.  In which
> case, they are going to want the large extent count stuff that just
> got merged into the for-next tree, and this whole discussion is
> moot....

With all the barriers to large extents count that you mentioned
I wonder how large extent counters feature mitigates those,
but that is irrelevant to the question at hand.

>
> > And users can create massive fragmented/reflinked images, can they not?
>
> Yes, and they will hit scalability problems long before they get
> anywhere near 4 billion extents.
>
> > If we have learned anything, is that if users can do something (i.e. on stable),
> > users will do that, so it may still be worth protecting this workflow?
>
> If I have learned anything, it's that huge extent counts are highly
> impractical for most workloads for one reason or another. We are a
> long way for enabling practical use of extent counts in the
> billions. Demand paging the extent list is the bare minimum we need,
> but then there's sheer scale of modifications reflink and unlink
> need to make (billions of transactions to share/free billions of
> individual extents) and there's no magic solution to that.
>
> > I argue that the reason that you did not see those constructs in the wild yet,
> > is the time it takes until users format new xfs filesystems with mkfs
>
> It really has nothing to do with filesystem formats and everything
> to do with the *cost* of creating, accessing, indexing and managing
> billions of extents.
>
> Have you ever tried to create a file with 4 billion extents in it?
> Even using fallocate to do it as fast as possible (no data IO!), I
> ran out of RAM on my 128GB test machine after 6 days of doing
> nothing but running fallocate() on a single inode. The kernel died a
> horrible OOM killer death at around 2.5 billion extents because the
> extent list cannot be reclaimed from memory while the inode is in
> use and the kernel ran out of all other memory it could reclaim as
> the extent list grew.
>
> The only way to fix that is to make the extent lists reclaimable
> (i.e. demand paging of the in-memory extent list) and that's a big
> chunk of work that isn't on anyone's radar right now.
>
> > Given your inputs, I am not sure that the fix has high value, but I must
> > say I didn't fully understand your argument.
> > It sounded like
> > "We don't need the fix because we did not see the problem yet",
> > but I may have misunderstood you.
>
> I hope you now realise that there are much bigger practical
> scalability limitations with extent lists and reflink that will
> manifest in production systems long before we get anywhere near
> billions of extents per inode.
>

I do!
And I *really* appreciate the time that you took to explain it to me
(and to everyone).

I'm dropping this series from my xfs-5.10.y queue.

Thanks,
Amir.