Hi, On Thu, Apr 20, 2023 at 3:23 AM Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote: > > On Tue, Apr 18, 2023 at 11:17:23AM +0200, Vlastimil Babka wrote: > > > Actually, the more I think about it the more I think the right answer > > > is to keep kcompactd as using MIGRATE_SYNC_LIGHT and make > > > MIGRATE_SYNC_LIGHT not block on the folio lock. kcompactd can accept > > > some blocking but we don't want long / unbounded blocking. Reading the > > > comments for MIGRATE_SYNC_LIGHT, this also seems like it fits pretty > > > well. MIGRATE_SYNC_LIGHT says that the stall time of writepage() is > > > too much. It's entirely plausible that someone else holding the lock > > > is doing something as slow as writepage() and thus waiting on the lock > > > can be just as bad for latency. > > > > +Cc Mel for potential insights. Sounds like a good compromise at first > > glance, but it's a tricky area. > > Also there are other callers of migration than compaction, and we should > > make sure we are not breaking them unexpectedly. > > > > It's tricky because part of the point of SYNC_LIGHT was to block on > some operations but not for too long. writepage was considered to be an > exception because it can be very slow for a variety of reasons. I think > At the time that writeback from reclaim context was possible and it was > very inefficient, more more inefficient than reads. Storage can also be > very slow (e.g. USB stick plugged in) and there can be large differences > between read/write performance (SMR, some ssd etc). Pages read were generally > clean and could be migrated, pages for write may be redirtied etc. It was > assumed that while both read/write could lock the page for a long time, > write had worse hold times and most other users of lock page were transient. I think some of the slowness gets into the complex ways that systems like ChromeOS are currently working. As mentioned in the commit message of my RFC patch, ChromeOS currently runs Android programs out of a 128K-block, zlib-compressed squashfs disk. That squashfs disk is actually a loopback mounted file on the main ext2 filesystem which is stored on something like eMMC. If I understand the everything correctly, if we get a page fault on memory backed by this squashfs filesystem, then we can end up holding a page/folio lock and then trying to read a pile of pages (enough to decompress the whole 128K block). ...but we don't read them directly, we instead block on ext4 which might need to allocate memory and then finally blocks on the block driver completing the task. This whole sequence of things is not necessarily fast. I think this is responsible for some of the very large numbers that were part of my original patch description. Without the above squashfs setup, we can still run into slow times but not quite as bad. I tried running a ChromeOS "memory pressure" test against a mainline kernel plus _just_ the browser (Android disabled). The test eventually opened over 90 tabs on my 4GB system and the system got pretty janky, but still barely usable. While running the test, I saw dozens of cases of folio_lock() taking over 10 ms and quite a few (~10?) of it taking over 100 ms. The peak I saw was ~380ms. I also happened to time buffer locking. That was much less bad with the worst case topping out at ~70ms. I'm not sure what timeout you'd consider to be bad. 10 ms? 20 ms? Also as a side note: I ran the same memory pressure but _with_ Android running (though it doesn't actually stress Android, it's just running in the background). To do this I had to run a downstream kernel. Here it was easy to see a ~1.7 ms wait on the page lock without any ridiculous amount of stressing. ...and a ~1.5 second wait for the buffer lock, too. > A compromise for SYNC_LIGHT or even SYNC on lock page would be to try > locking with a timeout. I don't think there is a specific helper but it > should be possible to trylock, wait on the folio_waitqueue and attempt > once to get the lock again. I didn't look very closely but it would > doing something similar to folio_wait_bit_common() with > io_schedule_timeout instead of io_schedule. This will have false > positives because the folio waitqueue may be woken for unrelated pages > and obviously it can race with other wait queues. > > kcompactd is an out-of-line wait and can afford to wait for a long time > without user-visible impact but 120 seconds or any potentially unbounded > length of time is ridiculous and unhelpful. I would still be wary about > adding new sync modes or making major modifications to kcompactd because > detecting application stalls due to a kcompactd modification is difficult. OK, I'll give this a shot. It doesn't look too hard, but we'll see. > There is another approach -- kcompactd or proactive under heavy memory > pressure is probably a waste of CPU time and resources and should > avoid or minimise effort when under pressure. While direct compaction > can capture a page for immediate use, kcompactd and proactive reclaim > are just shuffling memory around for *potential* users and may be making > the overall memory pressure even worse. If memory pressure detection was > better and proactive/kcompactd reclaim bailed then the unbounded time to > lock a page is mitigated or completely avoided. I probably won't try to take this on, though it does sound like a good idea for future research.