Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 25 May 2022 18:21:40 +1000

On Tue, May 24, 2022 at 07:05:07PM +0300, Amir Goldstein wrote:
> On Tue, May 24, 2022 at 8:36 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
> 
> Allow me to rephrase that using a less hypothetical use case.
> 
> Our team is working on an out-of-band dedupe tool, much like
> https://markfasheh.github.io/duperemove/duperemove.html
> but for larger scale filesystems and testing focus is on xfs.

dedupe is nothing new. It's being done in production systems and has
been for a while now. e.g. Veeam has a production server back end
for their reflink/dedupe based backup software that is hosted on
XFS.

The only scalability issues we've seen with those systems managing
tens of TB of heavily cross-linked files so far have been limited to
how long unlink of those large files takes. Dedupe/reflink speeds up
ingest for backup farms, but it slows down removal/garbage
collection of backup that are no longer needed. The big
reflink/dedupe backup farms I've seen problems with are generally
dealing with extent counts per file in the tens of millions,
which is still very managable.

Maybe we'll see more problems as data sets grow, but it's also
likely that the crosslinked data sets the applications build will
scale out (more base files) instead of up (larger base files). This
will mean they remain at the "tens of millions of extents per file"
level and won't stress the filesystem any more than they already do.

> In certain settings, such as containers, the tool does not control the
> running kernel and *if* we require a new kernel, the newest we can
> require in this setting is 5.10.y.

*If* you have a customer that creates a billion extents in a single
file, then you could consider backporting this. But until managing
billions of extents per file is an actual issue for production
filesystems, it's unnecessary to backport these changes.

> How would the tool know that it can safely create millions of dups
> that may get fragmented?

Millions or shared extents in a single file aren't a problem at all.
Millions of references to a single shared block aren't a problem at
all, either.

But there are limits to how much you can share a single block, and
those limits are *highly variable* because they are dependent on
free space being available to record references.  e.g. XFS can
share a single block a maximum of 2^32 -1 times. If a user turns on
rmapbt, that max share limit drops way down to however many
individual rmap records can be stored in the rmap btree before the
AG runs out of space. If the AGs are small and/or full of other data,
that could limit sharing of a single block to a few hundreds of
references.

IOWs, applications creating shared extents must expect the operation
to fail at any time, without warning. And dedupe applications need
to be able to index multiple replicas of the same block so that they
aren't limited to deduping that data to a single block that has
arbitrary limits on how many times it can be shared.

> Does anyone *object* to including this series in the stable kernel
> after it passes the tests?

If you end up having a customer that hits a billion extents in a
single file, then you can backport these patches to the 5.10.y
series. But without any obvious production need for these patches,
they don't fit the criteria for stable backports...

Don't change what ain't broke.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx