On Thu, Jul 23, 2020 at 5:47 PM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Thu, Jul 23, 2020 at 5:07 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > > > I say that for full disclosure, so you don't wrack your brains > > too much, when it may still turn out to be a screwup on my part. > > Sounds unlikely. > > If that patch applied even reasonably closely, I don't see how you'd > see a list corruption that wasn't due to the patch. > > You'd have had to use the wrong spinlock by mistake due to munging it, > or something crazy like that. > > The main list-handling change is > > (a) open-coding of that finish_wait() > > (b) slightly different heuristics for removal in the wakeup function > > where (a) was because my original version of finishing the wait needed > to do that return code checking. > > So a normal "finish_wait()" just does > > list_del_init(&wait->entry); > > where-as my open-coded one replaced that with > > if (!list_empty(&wait->entry)) { > list_del(&wait->entry); > ret = -EINTR; > } > > and apart from that "set return to -EINTR because nobody woke us up", > it also uses just a regular "list_del()" rather than a > "list_del_init()". That causes the next/prev field to be poisoned > rather than re-initialized. But that second change was because the > list entry is on the stack, and we're not touching it any more and are > about to return, so I removed the "init" part. > > Anyway, (a) really looks harmless. Unless the poisoning now triggers > some racy debug test that had been hidden by the "init". Hmm. > > In contrast, (b) means that the likely access patterns of irqs > removing the wait entry from the list might be very different from > before. The old "autoremove" semantics would only remove the entry > from the list when try_to_wake_up() actually woke things up. Now, a > successful bit state _always_ removes it, which was kind of the point. > But it might cause very different list handling patterns. > > All the actual list handling looks "obviously safe" because it's > protected by the spinlock, though... > > If you do get oopses with the new patch too, try to send me a copy, > and maybe I'll stare at exactly where it happens register contents and > go "aah". This new version is doing much better: many hours to go, but all machines have got beyond the danger point where yesterday's version was crashing - phew! Hugh