On Sat, 23 Oct 2021, Mel Gorman wrote: > On Fri, Oct 22, 2021 at 10:26:30PM +1100, NeilBrown wrote: > > On Fri, 22 Oct 2021, Mel Gorman wrote: > > > On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote: > > > > > > > In general, I still don't like the use of wake_up_all(), though it won't > > > > cause incorrect behaviour. > > > > > > > > > > Removing wake_up_all would be tricky. > > > > I think there is a misunderstanding. Removing wake_up_all() is as > > simple as > > s/wake_up_all/wake_up/ > > > > If you used prepare_to_wait_exclusive(), then wake_up() would only wake > > one waiter, while wake_up_all() would wake all of them. > > As you use prepare_to_wait(), wake_up() will wake all waiters - as will > > wake_up_all(). > > > > Ok, yes, there was a misunderstanding. I thought you were suggesting a > move to exclusive wakeups. I felt that the wake_up_all was explicit in > terms of intent and that I really meant for all tasks to wake instead of > one at a time. Fair enough. Thanks for changing it :-) But this prompts me to wonder if exclusive wakeups would be a good idea - which is a useful springboard to try to understand the code better. For VMSCAN_THROTTLE_ISOLATED they probably are. One pattern for reliable exclusive wakeups is for any thread that received a wake-up to then consider sending a wake up. Two places receive VMSCAN_THROTTLE_ISOLATED wakeups and both then call too_many_isolated() which - on success - sends another wakeup - before the caller has had a chance to isolate anything. If, instead, the wakeup was sent sometime later, after pages were isolated by before the caller (isoloate_migratepages_block() or shrink_inactive_list()) returned, then we would get an orderly progression of threads running through that code. For VMSCAN_THROTTLE_WRITEBACK is a little less straight forward. There are two different places that wait for the wakeup, and a wake_up is sent to all waiters after a time proportional to the number of waiters. It might make sense to wake one thread per time unit? That might work well for do_writepages - every SWAP_CLUSTER_MAX writes triggers one wakeup. I'm less sure that it would work for shrink_node(). Maybe the shrink_node() waiters could be non-exclusive so they get woken as soon a SWAP_CLUSTER_MAX writes complete, while do_writepages are exclusive and get woken one at a time. For VMSCAN_THROTTLE_NOPROGRESS .... I don't understand. If one zone isn't making "enough" progress, we throttle before moving on to the next zone. So we delay processing of the next zone, and only indirectly delay re-processing of the current congested zone. Maybe it make sense, but I don't see it yet. I note that the commit message says "it's messy". I can't argue with that! I'll follow up with patches to clarify what I am thinking about the first two. I'm not proposing the patches, just presenting them as part of improving my understanding. Thanks, NeilBrown