On Thu, 23 Jul 2020, Linus Torvalds wrote: > On Thu, Jul 23, 2020 at 4:11 PM Hugh Dickins <hughd@xxxxxxxxxx> wrote: > > On Thu, 23 Jul 2020, Linus Torvalds wrote: > > > > > > I'll send a new version after I actually test it. > > > > I'll give it a try when you're happy with it. > > Ok, what I described is what I've been running for a while now. But I > don't put much stress on my system with my normal workload, so.. > > > I did try yesterday's > > with my swapping loads on home machines (3 of 4 survived 16 hours), > > and with some google stresstests on work machines (0 of 10 survived). > > > > I've not spent long analyzing the crashes, all of them in or below > > __wake_up_common() called from __wake_up_locked_key_bookmark(): > > sometimes gets to run the curr->func() and crashes on something > > inside there (often list_del's lib/list_debug.c:53!), sometimes > > cannot get that far. Looks like the wait queue entries on the list > > were not entirely safe with that patch. > > Hmm. The bug Oleg pointed out should be pretty theoretical. But I > think the new approach with WQ_FLAG_WOKEN was much better anyway, > despite me missing that one spot in the first version of the patch. > > So here's two patches - the first one does that wake_page_function() > conversion, and the second one just does the memory ordering cleanup I > mentioned. > > I don't think the second one shouldn't matter on x86, but who knows. > > I don't enable list debugging, but I find list corruption surprising. > All of _that_ should be inside the page waiqueue lock, the only > unlocked part was the "list_empty_careful()" part. > > But I'll walk over my patch mentally one more time. Here's the current > version, anyway. Thanks, I'll start some tests going shortly. I do have to "port" these patches to a different kernel, and my first assumption on seeing crashes was that I'd screwed that up; but that seemed much less likely once the home test on top of v5.8-rc5 crashed in much the same way. The latter was not a list_del() crash, but on curr->func itself; but I take them all as just indicating that the wait queue entry can in rare cases be freed and reused. (And the amount of "port"ing was close to nil here: our trees did differ on an "unlikely" that one end had added or removed, plus I did start off by reverting two of my three patches. But perhaps I'm missing a subtle dependence on differences elsewhere in the tree.) I say that for full disclosure, so you don't wrack your brains too much, when it may still turn out to be a screwup on my part. Hugh