Yang Shi <shy828301@xxxxxxxxx> writes: > On Thu, Aug 20, 2020 at 8:22 AM Dave Hansen <dave.hansen@xxxxxxxxx> wrote: >> >> On 8/20/20 1:06 AM, Huang, Ying wrote: >> >> + /* Migrate pages selected for demotion */ >> >> + nr_reclaimed += demote_page_list(&ret_pages, &demote_pages, pgdat, sc); >> >> + >> >> pgactivate = stat->nr_activate[0] + stat->nr_activate[1]; >> >> >> >> mem_cgroup_uncharge_list(&free_pages); >> >> _ >> > Generally, it's good to batch the page migration. But one side effect >> > is that, if the pages are failed to be migrated, they will be placed >> > back to the LRU list instead of falling back to be reclaimed really. >> > This may cause some issue in some situation. For example, if there's no >> > enough space in the PMEM (slow) node, so the page migration fails, OOM >> > may be triggered, because the direct reclaiming on the DRAM (fast) node >> > may make no progress, while it can reclaim some pages really before. >> >> Yes, agreed. > > Kind of. But I think that should be transient and very rare. The > kswapd on pmem nodes will be waken up to drop pages when we try to > allocate migration target pages. It should be very rare that there is > not reclaimable page on pmem nodes. > >> >> There are a couple of ways we could fix this. Instead of splicing >> 'demote_pages' back into 'ret_pages', we could try to get them back on >> 'page_list' and goto the beginning on shrink_page_list(). This will >> probably yield the best behavior, but might be a bit ugly. >> >> We could also add a field to 'struct scan_control' and just stop trying >> to migrate after it has failed one or more times. The trick will be >> picking a threshold that doesn't mess with either the normal reclaim >> rate or the migration rate. > > In my patchset I implemented a fallback mechanism via adding a new > PGDAT_CONTENDED node flag. Please check this out: > https://patchwork.kernel.org/patch/10993839/. > > Basically the PGDAT_CONTENDED flag will be set once migrate_pages() > return -ENOMEM which indicates the target pmem node is under memory > pressure, then it would fallback to regular reclaim path. The flag > would be cleared by clear_pgdat_congested() once the pmem node memory > pressure is gone. There may be some races between the flag set and clear. For example, - try to migrate some pages from DRAM node to PMEM node - no enough free pages on the PMEM node, so wakeup kswapd - kswapd on PMEM node reclaimed some page and try to clear PGDAT_CONTENDED on DRAM node - set PGDAT_CONTENDED on DRAM node This may be resolvable. But I still prefer to fallback to real page reclaiming directly for the pages failed to be migrated. That looks more robust. Best Regards, Huang, Ying > We already use node flags to indicate the state of node in reclaim > code, i.e. PGDAT_WRITEBACK, PGDAT_DIRTY, etc. So, adding a new flag > sounds more straightforward to me IMHO. > >> >> This is on my list to fix up next. >>