On Sat, Jun 27, 2020 at 08:32:54AM +1000, Dave Chinner wrote: > Observation from the outside: > > "However I'm having trouble convincing myself that's actually > possible on x86_64.... " Using the weaker rules of LKMM (as relevant to Power) I could in fact make it happen, the 'problem' is that it's being observed on the much stronger x86_64. So possibly I did overlook a more 'sensible' scenario, but I'm pretty confident the problem holds as it fully explains the failure mode. > This scheduler code has fallen off a really high ledge on the memory > barrier cliff, hasn't it? Just a wee bit.. I did need pen and paper and a fair amount of scribbling for this one. > Having looked at this code over the past 24 hours and the recent > history, I know that understanding it - let alone debugging and > fixing problem in it - is way beyond my capabilities. And I say > that as an experienced kernel developer with a pretty good grasp of > concurrent programming and a record of implementing a fair number of > non-trivial lockless algorithms over the years.... All in the name of making it go fast, I suppose. It used to be much simpler... like much of the kernel. The biggest problem I had with this thing was that the reproduction case we had (Paul's rcutorture) wouldn't readily trigger on my machines (altough it did, but at a much lower rate, just twice in a week's worth of runtime). Also; I'm sure you can spot a problem in the I/O layer much faster than I possibly could :-) Anyway, let me know if you still observe any problems.