Re: [PATCH v4 0/8] Remove dependency on congestion_wait in mm/

"NeilBrown" <neilb@xxxxxxx> · Wed, 27 Oct 2021 11:43:22 +1100

On Sat, 23 Oct 2021, Mel Gorman wrote:
> On Fri, Oct 22, 2021 at 10:26:30PM +1100, NeilBrown wrote:
> > On Fri, 22 Oct 2021, Mel Gorman wrote:
> > > On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote:
> > > 
> > > > In general, I still don't like the use of wake_up_all(), though it won't
> > > > cause incorrect behaviour.
> > > > 
> > > 
> > > Removing wake_up_all would be tricky.
> > 
> > I think there is a misunderstanding.  Removing wake_up_all() is as
> > simple as
> >    s/wake_up_all/wake_up/
> > 
> > If you used prepare_to_wait_exclusive(), then wake_up() would only wake
> > one waiter, while wake_up_all() would wake all of them.
> > As you use prepare_to_wait(), wake_up() will wake all waiters - as will
> > wake_up_all(). 
> > 
> 
> Ok, yes, there was a misunderstanding. I thought you were suggesting a
> move to exclusive wakeups. I felt that the wake_up_all was explicit in
> terms of intent and that I really meant for all tasks to wake instead of
> one at a time.

Fair enough.  Thanks for changing it :-)

But this prompts me to wonder if exclusive wakeups would be a good idea
- which is a useful springboard to try to understand the code better.

For VMSCAN_THROTTLE_ISOLATED they probably are.
One pattern for reliable exclusive wakeups is for any thread that
received a wake-up to then consider sending a wake up.

Two places receive VMSCAN_THROTTLE_ISOLATED wakeups and both then call
too_many_isolated() which - on success - sends another wakeup - before
the caller has had a chance to isolate anything.  If, instead, the
wakeup was sent sometime later, after pages were isolated by before the
caller (isoloate_migratepages_block() or shrink_inactive_list())
returned, then we would get an orderly progression of threads running
through that code.

For VMSCAN_THROTTLE_WRITEBACK is a little less straight forward.
There are two different places that wait for the wakeup, and a wake_up
is sent to all waiters after a time proportional to the number of
waiters.  It might make sense to wake one thread per time unit?
That might work well for do_writepages - every SWAP_CLUSTER_MAX writes
triggers one wakeup.
I'm less sure that it would work for shrink_node().  Maybe the
shrink_node() waiters could be non-exclusive so they get woken as soon a
SWAP_CLUSTER_MAX writes complete, while do_writepages are exclusive and
get woken one at a time.

For VMSCAN_THROTTLE_NOPROGRESS .... I don't understand.
If one zone isn't making "enough" progress, we throttle before moving on
to the next zone.  So we delay processing of the next zone, and only
indirectly delay re-processing of the current congested zone.
Maybe it make sense, but I don't see it yet.  I note that the commit
message says "it's messy".  I can't argue with that!

I'll follow up with patches to clarify what I am thinking about the
first two.  I'm not proposing the patches, just presenting them as part
of improving my understanding.

Thanks,
NeilBrown