On Thu, Oct 03, 2024 at 08:50:37PM +0200, Peter Zijlstra wrote: > On Thu, Oct 03, 2024 at 09:04:30AM -0700, Paul E. McKenney wrote: > > On Thu, Oct 03, 2024 at 04:22:40PM +0200, Peter Zijlstra wrote: > > > On Thu, Oct 03, 2024 at 05:45:47AM -0700, Paul E. McKenney wrote: > > > > > > > I ran 100*TREE03 for 18 hours each, and got 23 instances of *something* > > > > happening (and I need to suppress stalls on the repeat). One of the > > > > earlier bugs happened early, but sadly not this one. > > > > > > Damn, I don't have the amount of CPU hours available you mention in your > > > later email. I'll just go up the rounds to 20 minutes and see if > > > something wants to go bang before I have to shut down the noise > > > pollution for the day... > > > > Indeed, this was one reason I was soliciting debug patches. ;-) > > Sooo... I was contemplating if something like the below might perhaps > help some. It's a bit of a mess (I'll try and clean up if/when it > actually proves to work), but it compiles and survives a hand full of 1m > runs. And here is the ftrace dump from one of the failures in the past 18-hour run. Idiot here re-enabled RCU CPU stall warnings after doing ftrace_dump(), forgetting the asynchronous nature of new-age printk(), so I don't have the CPU number that the failure happened on. Of to test your new patch... Thanx, Paul