Re: [PATCH] iomap: Address soft lockup in iomap_finish_ioend()

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 4 Jan 2022 09:03:10 +1100

On Sat, Jan 01, 2022 at 05:39:45PM +0000, Trond Myklebust wrote:
> On Sat, 2022-01-01 at 14:55 +1100, Dave Chinner wrote:
> > As it is, if you are getting soft lockups in this location, that's
> > an indication that the ioend chain that is being built by XFS is
> > way, way too long. IOWs, the completion latency problem is caused by
> > a lack of submit side ioend chain length bounding in combination
> > with unbound completion side merging in xfs_end_bio - it's not a
> > problem with the generic iomap code....
> > 
> > Let's try to address this in the XFS code, rather than hack
> > unnecessary band-aids over the problem in the generic code...
> > 
> > Cheers,
> > 
> > Dave.
> 
> Fair enough. As long as someone is working on a solution, then I'm
> happy. Just a couple of things:
> 
> Firstly, we've verified that the cond_resched() in the bio loop does
> suffice to resolve the issue with XFS, which would tend to confirm what
> you're saying above about the underlying issue being the ioend chain
> length.
> 
> Secondly, note that we've tested this issue with a variety of older
> kernels, including 4.18.x, 5.1.x and 5.15.x, so please bear in mind
> that it would be useful for any fix to be backward portable through the
> stable mechanism.

The infrastructure hasn't changed that much, so whatever the result
is it should be backportable.

As it is, is there a specific workload that triggers this issue? Or
a specific machine config (e.g. large memory, slow storage). Are
there large fragmented files in use (e.g. randomly written VM image
files)? There are a few factors that can exacerbate the ioend chain
lengths, so it would be handy to have some idea of what is actually
triggering this behaviour...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx