On Mon, Oct 14, 2024 at 12:34:37PM -0400, Brian Foster wrote: > On Mon, Oct 14, 2024 at 03:55:24PM +0800, kernel test robot wrote: > > > > > > Hello, > > > > kernel test robot noticed a -98.4% regression of stress-ng.metamix.ops_per_sec on: > > > > > > commit: c5c810b94cfd818fc2f58c96feee58a9e5ead96d ("iomap: fix handling of dirty folios over unwritten extents") > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > > > > testcase: stress-ng > > config: x86_64-rhel-8.3 > > compiler: gcc-12 > > test machine: 64 threads 2 sockets Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz (Ice Lake) with 256G memory > > parameters: > > > > nr_threads: 100% > > disk: 1HDD > > testtime: 60s > > fs: xfs > > test: metamix > > cpufreq_governor: performance > > > > > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > the same patch/commit), kindly add following tags > > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx> > > | Closes: https://lore.kernel.org/oe-lkp/202410141536.1167190b-oliver.sang@xxxxxxxxx > > > > > > Details are as below: > > --------------------------------------------------------------------------------------------------> > > > > > > The kernel config and materials to reproduce are available at: > > https://download.01.org/0day-ci/archive/20241014/202410141536.1167190b-oliver.sang@xxxxxxxxx > > > > So I basically just run this on a >64xcpu guest and reproduce the delta: > > stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --metamix 64 > > The short of it is that with tracing enabled, I see a very large number > of extending writes across unwritten mappings, which basically means XFS > eof zeroing is calling zero range and hitting the newly introduced > flush. This is all pretty much expected given the patch. Ouch. The conditions required to cause this regression are that we either first use fallocate() to preallocate beyond EOF, or buffered writes trigger specualtive delalloc beyond EOF and they get converted to unwritten beyond EOF through background writeback or fsync operations. Both of these lead to unwritten extents beyond EOF that extending writes will fall into. All we need now is the extending writes to be slightly non-sequential and those non-sequential extending writes will not land at EOF but at some distance beyond it. At this point, we trigger the new flush code. Unfortunately, this is actually a fairly common workload pattern. For example, experience tells me that NFS server processing of async sequential write requests from a client will -always- end up with slightly out of order extending writes because the incoming async write requests are processed concurrently. Hence they always race to extend the file and slightly out of order file extension happens quite frequently. Further, the NFS client will also periodically be sending a write commit request (i.e. server side fsync), the NFS server writeback will convert the speculative delalloc that extends beyond EOF into unwritten extents beyond EOF whilst the incoming extending write requests are still incoming from the client. Hence I think that there are common workloads (e.g. large sequential writes on a NFS client) that set up the exact conditions and IO patterns necessary to trigger this performance regression in production systems... > I ran a quick experiment to skip the flush on sub-4k ranges in favor of > doing explicit folio zeroing. The idea with that is that the range is > likely restricted to single folio and since it's dirty, we can assume > unwritten conversion is imminent and just explicitly zero the range. I > still see a decent number of flushes from larger ranges in that > experiment, but that still seems to get things pretty close to my > baseline test (on a 6.10 distro kernel). What filesystems other than XFS actually need this iomap bandaid right now? If there are none (which I think is the case), then we should just revert this change it until a more performant fix is available for XFS. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx