On Sat, Jan 01, 2022 at 05:39:45PM +0000, Trond Myklebust wrote: > On Sat, 2022-01-01 at 14:55 +1100, Dave Chinner wrote: > > As it is, if you are getting soft lockups in this location, that's > > an indication that the ioend chain that is being built by XFS is > > way, way too long. IOWs, the completion latency problem is caused by > > a lack of submit side ioend chain length bounding in combination > > with unbound completion side merging in xfs_end_bio - it's not a > > problem with the generic iomap code.... > > > > Let's try to address this in the XFS code, rather than hack > > unnecessary band-aids over the problem in the generic code... > > > > Cheers, > > > > Dave. > > Fair enough. As long as someone is working on a solution, then I'm > happy. Just a couple of things: > > Firstly, we've verified that the cond_resched() in the bio loop does > suffice to resolve the issue with XFS, which would tend to confirm what > you're saying above about the underlying issue being the ioend chain > length. > > Secondly, note that we've tested this issue with a variety of older > kernels, including 4.18.x, 5.1.x and 5.15.x, so please bear in mind > that it would be useful for any fix to be backward portable through the > stable mechanism. The infrastructure hasn't changed that much, so whatever the result is it should be backportable. As it is, is there a specific workload that triggers this issue? Or a specific machine config (e.g. large memory, slow storage). Are there large fragmented files in use (e.g. randomly written VM image files)? There are a few factors that can exacerbate the ioend chain lengths, so it would be handy to have some idea of what is actually triggering this behaviour... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx