RE: writeback completion soft lockup BUG in folio_wake_bit()

"Arechiga Lopez, Jesus A" <jesus.a.arechiga.lopez@xxxxxxxxx> · Tue, 25 Oct 2022 15:58:32 +0000

> -----Original Message-----
> From: Williams, Dan J <dan.j.williams@xxxxxxxxx>
> Sent: Monday, October 24, 2022 3:36 PM
> To: Torvalds, Linus <torvalds@xxxxxxxxxxxxxxxxxxxx>; Williams, Dan J
> <dan.j.williams@xxxxxxxxx>
> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>; Brian Foster
> <bfoster@xxxxxxxxxx>; Linux-MM <linux-mm@xxxxxxxxx>; linux-fsdevel
> <linux-fsdevel@xxxxxxxxxxxxxxx>; linux-xfs <linux-xfs@xxxxxxxxxxxxxxx>;
> Hugh Dickins <hughd@xxxxxxxxxx>; Arechiga Lopez, Jesus A
> <jesus.a.arechiga.lopez@xxxxxxxxx>; tim.c.chen@xxxxxxxxxxxxxxx
> Subject: Re: writeback completion soft lockup BUG in folio_wake_bit()
> 
> Linus Torvalds wrote:
> > On Mon, Oct 24, 2022 at 1:13 PM Dan Williams <dan.j.williams@xxxxxxxxx>
> wrote:
> > >
> > > Arechiga reports that his test case that failed "fast" before now
> > > ran for 28 hours without a soft lockup report with the proposed
> > > patches applied. So, I would consider those:
> > >
> > > Tested-by: Jesus Arechiga Lopez <jesus.a.arechiga.lopez@xxxxxxxxx>
> >
> > Ok, great.
> >
> > I really like that patch myself (and obviously liked it back when it
> > was originally proposed), but I think it was always held back by the
> > fact that we didn't really have any hard data for it.
> >
> > It does sound like we now very much have hard data for "the page
> > waitlist complexity is now a bigger problem than the historical
> > problem it tried to solve".
> >
> > So I'll happily apply it. The only question is whether it's a "let's
> > do this for 6.2", or if it's something that we'd want to back-port
> > anyway, and might as well apply sooner rather than later as a fix.
> >
> > I think that in turn then depends on just how artificial the test case
> > was. If the test case was triggered by somebody seeing problems in
> > real life loads, that would make the urgency a lot higher. But if it
> > was purely a synthetic test case with no accompanying "this is what
> > made us look at this" problem, it might be a 6.2 thing.
> >
> > Arechiga?
> 
> I will let Arechiga reply as well, but my sense is that this is more in the latter
> camp of not urgent because the test case is trying to generate platform
> stress (success!), not necessarily trying to get real work done.

Yes, as Dan mentioned it is trying to generate platform stress, We've been seeing the soft lockup events on test targets (2 sockets with high core count CPU's, and a lot of RAM). 

The workload stresses every core/CPU thread in various ways and logs results to a shared log file (every core writing to the same log file).  We found that this shared log file was related to the soft lockups.

With this change applied to 5.19 it seems that the soft lockups are no longer happening with this workload + configuration.