On 06/01/2018 01:30 AM, Hugh Dickins wrote: > On Fri, 1 Jun 2018, Ivan Kalvachev wrote: >> On 5/31/18, Greg Thelen <gthelen@xxxxxxxxxx> wrote: >>> >>> This looks like yesterday's https://lkml.org/lkml/2018/5/30/1158 >>> >> >> Yes, it seems to be the same problem. >> It also have better technical description. > > Well, your paragraph above on "Big memory consumers" gives a much > better user viewpoint, and a more urgent case for the patch to go in, > to stable if it does not make 4.17.0. > > But I am surprised: the change is in a block of code only used in > one of the modes of compaction (not in reclaim itself), and I thought > it was a mode which gives up quite easily, rather than visibly blocking. > > So I wonder if there's another issue to be improved here, > and the mistreatment of the ex-swap pages just exposed it somehow. > Cc'ing Vlastimil and David in case it triggers any insight from them. My guess is that the problem is compaction fails because of the isolation failures, causing further reclaim/complaction attempts with higher priority, in the context of non-costly thus non-failing allocations. Initially I thought that increased priority of compaction would eventually synchronous and thus not go via this block of code anymore. But (see isolate_migratepages()) only MIGRATE_SYNC compaction mode drops the ISOLATE_ASYNC_MIGRATE isolate_mode flag. And MIGRATE_SYNC is only used for compaction triggered via /proc - direct compaction stops at MIGRATE_SYNC_LIGHT. Maybe that could be changed? Mel had reasons to limit to SYNC_LIGHT, I guess... If the above is correct, it means that even with gigabytes of free memory you can fail order-3 (max non-costly order) allocation if compaction doesn't work properly. That's a bit surprising, but not impossible I guess... Vlastimil >> >> Such let down. >> It took me so much time to bisect the issue... > > Thank you for all your work on it, odd how we found it at the same > time: I was just porting Mel's patch into another tree, had to make > a change near there, and suddenly noticed that the test was wrong. > > Hugh > >> >> Well, I hope that the fix will get into 4.17 release in time. >