On Sun, 2022-10-23 at 15:38 -0700, Linus Torvalds wrote: > On Wed, Oct 19, 2022 at 6:35 PM Dan Williams > <dan.j.williams@xxxxxxxxx> wrote: > > > > A report from a tester with this call trace: > > > > watchdog: BUG: soft lockup - CPU#127 stuck for 134s! > > [ksoftirqd/127:782] > > RIP: 0010:_raw_spin_unlock_irqrestore+0x19/0x40 [..] > > Whee. > > > ...lead me to this thread. This was after I had them force all > > softirqs > > to run in ksoftirqd context, and run with rq_affinity == 2 to force > > I/O completion work to throttle new submissions. > > > > Willy, are these headed upstream: > > > > https://lore.kernel.org/all/YjSbHp6B9a1G3tuQ@xxxxxxxxxxxxxxxxxxxx > > > > ...or I am missing an alternate solution posted elsewhere? > > Can your reporter test that patch? I think it should still apply > pretty much as-is.. And if we actually had somebody who had a > test-case that was literally fixed by getting rid of the old bookmark > code, that would make applying that patch a no-brainer. > > The problem is that the original load that caused us to do that thing > in the first place isn't repeatable because it was special production > code - so removing that bookmark code because we _think_ it now hurts > more than it helps is kind of a big hurdle. > > But if we had some hard confirmation from somebody that "yes, the > bookmark code is now hurting", that would make it a lot more > palatable > to just remove the code that we just _think_ that probably isn't > needed any more.. > > I do think that the original locked page on migration problem was fixed by commit 9a1ea439b16b. Unfortunately the customer did not respond to us when we asked them to test their workload when that patch went into the mainline. I don't have objection to Matthew's fix to remove the bookmark code, now that it is causing problems with this scenario that I didn't anticipate in my original code. Tim