Re: writeback completion soft lockup BUG in folio_wake_bit()

Dan Williams <dan.j.williams@xxxxxxxxx> · Wed, 19 Oct 2022 18:35:44 -0700

Linus Torvalds wrote:
> On Fri, Mar 18, 2022 at 7:45 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >
> > Excellent!  I'm going to propose these two patches for -rc1 (I don't
> > think we want to be playing with this after -rc8)
> 
> Ack. I think your commit message may be a bit too optimistic (who
> knows if other loads can trigger the over-long page locking wait-queue
> latencies), but since I don't see any other ways to really check this
> than just trying it, let's do it.
> 
>                  Linus

A report from a tester with this call trace:

 watchdog: BUG: soft lockup - CPU#127 stuck for 134s! [ksoftirqd/127:782]
 RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x40
 [..]
 Call Trace:
  <TASK>
  folio_wake_bit+0x8a/0x110
  folio_end_writeback+0x37/0x80
  ext4_finish_bio+0x19a/0x270
  ext4_end_bio+0x47/0x140
  blk_update_request+0x112/0x410

...lead me to this thread. This was after I had them force all softirqs
to run in ksoftirqd context, and run with rq_affinity == 2 to force
I/O completion work to throttle new submissions.

Willy, are these headed upstream:

https://lore.kernel.org/all/YjSbHp6B9a1G3tuQ@xxxxxxxxxxxxxxxxxxxx

...or I am missing an alternate solution posted elsewhere?