On Tue, May 24, 2022 at 07:05:07PM +0300, Amir Goldstein wrote: > On Tue, May 24, 2022 at 8:36 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > Allow me to rephrase that using a less hypothetical use case. > > Our team is working on an out-of-band dedupe tool, much like > https://markfasheh.github.io/duperemove/duperemove.html > but for larger scale filesystems and testing focus is on xfs. dedupe is nothing new. It's being done in production systems and has been for a while now. e.g. Veeam has a production server back end for their reflink/dedupe based backup software that is hosted on XFS. The only scalability issues we've seen with those systems managing tens of TB of heavily cross-linked files so far have been limited to how long unlink of those large files takes. Dedupe/reflink speeds up ingest for backup farms, but it slows down removal/garbage collection of backup that are no longer needed. The big reflink/dedupe backup farms I've seen problems with are generally dealing with extent counts per file in the tens of millions, which is still very managable. Maybe we'll see more problems as data sets grow, but it's also likely that the crosslinked data sets the applications build will scale out (more base files) instead of up (larger base files). This will mean they remain at the "tens of millions of extents per file" level and won't stress the filesystem any more than they already do. > In certain settings, such as containers, the tool does not control the > running kernel and *if* we require a new kernel, the newest we can > require in this setting is 5.10.y. *If* you have a customer that creates a billion extents in a single file, then you could consider backporting this. But until managing billions of extents per file is an actual issue for production filesystems, it's unnecessary to backport these changes. > How would the tool know that it can safely create millions of dups > that may get fragmented? Millions or shared extents in a single file aren't a problem at all. Millions of references to a single shared block aren't a problem at all, either. But there are limits to how much you can share a single block, and those limits are *highly variable* because they are dependent on free space being available to record references. e.g. XFS can share a single block a maximum of 2^32 -1 times. If a user turns on rmapbt, that max share limit drops way down to however many individual rmap records can be stored in the rmap btree before the AG runs out of space. If the AGs are small and/or full of other data, that could limit sharing of a single block to a few hundreds of references. IOWs, applications creating shared extents must expect the operation to fail at any time, without warning. And dedupe applications need to be able to index multiple replicas of the same block so that they aren't limited to deduping that data to a single block that has arbitrary limits on how many times it can be shared. > Does anyone *object* to including this series in the stable kernel > after it passes the tests? If you end up having a customer that hits a billion extents in a single file, then you can backport these patches to the 5.10.y series. But without any obvious production need for these patches, they don't fit the criteria for stable backports... Don't change what ain't broke. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx