Re: [RFC PATCH] migrate_pages: Never block waiting for the page lock

Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> · Thu, 20 Apr 2023 11:23:04 +0100

On Tue, Apr 18, 2023 at 11:17:23AM +0200, Vlastimil Babka wrote:
> > Actually, the more I think about it the more I think the right answer
> > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make
> > MIGRATE_SYNC_LIGHT not block on the folio lock. kcompactd can accept
> > some blocking but we don't want long / unbounded blocking. Reading the
> > comments for MIGRATE_SYNC_LIGHT, this also seems like it fits pretty
> > well. MIGRATE_SYNC_LIGHT says that the stall time of writepage() is
> > too much. It's entirely plausible that someone else holding the lock
> > is doing something as slow as writepage() and thus waiting on the lock
> > can be just as bad for latency.
> 
> +Cc Mel for potential insights. Sounds like a good compromise at first
> glance, but it's a tricky area.
> Also there are other callers of migration than compaction, and we should
> make sure we are not breaking them unexpectedly.
> 

It's tricky because part of the point of SYNC_LIGHT was to block on
some operations but not for too long. writepage was considered to be an
exception because it can be very slow for a variety of reasons. I think
At the time that writeback from reclaim context was possible and it was
very inefficient, more more inefficient than reads.  Storage can also be
very slow (e.g. USB stick plugged in) and there can be large differences
between read/write performance (SMR, some ssd etc). Pages read were generally
clean and could be migrated, pages for write may be redirtied etc. It was
assumed that while both read/write could lock the page for a long time,
write had worse hold times and most other users of lock page were transient.

A compromise for SYNC_LIGHT or even SYNC on lock page would be to try
locking with a timeout. I don't think there is a specific helper but it
should be possible to trylock, wait on the folio_waitqueue and attempt
once to get the lock again. I didn't look very closely but it would
doing something similar to folio_wait_bit_common() with
io_schedule_timeout instead of io_schedule. This will have false
positives because the folio waitqueue may be woken for unrelated pages
and obviously it can race with other wait queues.

kcompactd is an out-of-line wait and can afford to wait for a long time
without user-visible impact but 120 seconds or any potentially unbounded
length of time is ridiculous and unhelpful. I would still be wary about
adding new sync modes or making major modifications to kcompactd because
detecting application stalls due to a kcompactd modification is difficult.

There is another approach -- kcompactd or proactive under heavy memory
pressure is probably a waste of CPU time and resources and should
avoid or minimise effort when under pressure. While direct compaction
can capture a page for immediate use, kcompactd and proactive reclaim
are just shuffling memory around for *potential* users and may be making
the overall memory pressure even worse. If memory pressure detection was
better and proactive/kcompactd reclaim bailed then the unbounded time to
lock a page is mitigated or completely avoided.

The two options are ortogonal. trylock_page_timeout() would mitigate
worst case behaviour while memory pressure detection and bailing on SYNC*
compaction side-steps the issue in extreme cases. Both have value as
pressure detection would be best-effort and trylock_page_timeout() would
limit worst-case behaviour.

> > I'll try to send out a v2 with this approach today and we can see what
> > people think.
> 
> Please Cc Mel and myself for further versions.
> 

Yes please.

-- 
Mel Gorman
SUSE Labs