Re: [PATCH] iomap: Address soft lockup in iomap_finish_ioend()

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 4 Jan 2022 00:04:23 +0000

On Tue, 2022-01-04 at 09:03 +1100, Dave Chinner wrote:
> On Sat, Jan 01, 2022 at 05:39:45PM +0000, Trond Myklebust wrote:
> > On Sat, 2022-01-01 at 14:55 +1100, Dave Chinner wrote:
> > > As it is, if you are getting soft lockups in this location,
> > > that's
> > > an indication that the ioend chain that is being built by XFS is
> > > way, way too long. IOWs, the completion latency problem is caused
> > > by
> > > a lack of submit side ioend chain length bounding in combination
> > > with unbound completion side merging in xfs_end_bio - it's not a
> > > problem with the generic iomap code....
> > > 
> > > Let's try to address this in the XFS code, rather than hack
> > > unnecessary band-aids over the problem in the generic code...
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > 
> > Fair enough. As long as someone is working on a solution, then I'm
> > happy. Just a couple of things:
> > 
> > Firstly, we've verified that the cond_resched() in the bio loop
> > does
> > suffice to resolve the issue with XFS, which would tend to confirm
> > what
> > you're saying above about the underlying issue being the ioend
> > chain
> > length.
> > 
> > Secondly, note that we've tested this issue with a variety of older
> > kernels, including 4.18.x, 5.1.x and 5.15.x, so please bear in mind
> > that it would be useful for any fix to be backward portable through
> > the
> > stable mechanism.
> 
> The infrastructure hasn't changed that much, so whatever the result
> is it should be backportable.
> 
> As it is, is there a specific workload that triggers this issue? Or
> a specific machine config (e.g. large memory, slow storage). Are
> there large fragmented files in use (e.g. randomly written VM image
> files)? There are a few factors that can exacerbate the ioend chain
> lengths, so it would be handy to have some idea of what is actually
> triggering this behaviour...
> 
> Cheers,
> 
> Dave.

We have different reproducers. The common feature appears to be the
need for a decently fast box with fairly large memory (128GB in one
case, 400GB in the other). It has been reproduced with HDs, SSDs and
NVME systems.

On the 128GB box, we had it set up with 10+ disks in a JBOD
configuration and were running the AJA system tests.

On the 400GB box, we were just serially creating large (> 6GB) files
using fio and that was occasionally triggering the issue. However doing
an strace of that workload to disk reproduced the problem faster :-).

So really, it seems as if the problem is 'lots of data in cache' and
then flush it out.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx