On Thu, Dec 05, 2024 at 05:31:26PM -0700, Yu Zhao wrote: > With the aging feedback no longer considering the distribution of > folios in each generation, rework workingset protection to better > distribute folios across MAX_NR_GENS. This is achieved by reusing > PG_workingset and PG_referenced/LRU_REFS_FLAGS in a slightly different > way. > > For folios accessed multiple times through file descriptors, make > lru_gen_inc_refs() set additional bits of LRU_REFS_WIDTH in > folio->flags after PG_referenced, then PG_workingset after > LRU_REFS_WIDTH. After all its bits are set, i.e., > LRU_REFS_FLAGS|BIT(PG_workingset), a folio is lazily promoted into the > second oldest generation in the eviction path. And when > folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that > lru_gen_inc_refs() can start over. For this case, LRU_REFS_MASK is > only valid when PG_referenced is set. > > For folios accessed multiple times through page tables, > folio_update_gen() from a page table walk or lru_gen_set_refs() from a > rmap walk sets PG_referenced after the accessed bit is cleared for the > first time. Thereafter, those two paths set PG_workingset and promote > folios to the youngest generation. Like folio_inc_gen(), when > folio_update_gen() does that, it also clears PG_referenced. For this > case, LRU_REFS_MASK is not used. > > For both of the cases, after PG_workingset is set on a folio, it > remains until this folio is either reclaimed, or "deactivated" by > lru_gen_clear_refs(). It can be set again if lru_gen_test_recent() > returns true upon a refault. > > When adding folios to the LRU lists, lru_gen_distance() distributes > them as follows: > +---------------------------------+---------------------------------+ > | Accessed thru page tables | Accessed thru file descriptors | > +---------------------------------+---------------------------------+ > | PG_active (set while isolated) | | > +----------------+----------------+----------------+----------------+ > | PG_workingset | PG_referenced | PG_workingset | LRU_REFS_FLAGS | > +---------------------------------+---------------------------------+ > |<--------- MIN_NR_GENS --------->| | > |<-------------------------- MAX_NR_GENS -------------------------->| > > After this patch, some typical client and server workloads showed > improvements under heavy memory pressure. For example, Python TPC-C, > which was used to benchmark a different approach [1] to better detect > refault distances, showed a significant decrease in total refaults: > Before After Change > Time (seconds) 10801 10801 0% > Executed (transactions) 41472 43663 +5% > workingset_nodes 109070 120244 +10% > workingset_refault_anon 5019627 7281831 +45% > workingset_refault_file 1294678786 554855564 -57% > workingset_refault_total 1299698413 562137395 -57% > > [1] https://lore.kernel.org/20230920190244.16839-1-ryncsn@xxxxxxxxx/ > > Reported-by: Kairui Song <kasong@xxxxxxxxxxx> > Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@xxxxxxxxxxxxxx/ > Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx> > Tested-by: Kalesh Singh <kaleshsingh@xxxxxxxxxx> > --- > include/linux/mm_inline.h | 94 +++++++++++++------------ > include/linux/mmzone.h | 82 +++++++++++++--------- > mm/swap.c | 23 +++--- > mm/vmscan.c | 142 +++++++++++++++++++++++--------------- > mm/workingset.c | 29 ++++---- > 5 files changed, 209 insertions(+), 161 deletions(-) Some outlier results from LULESH (Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics) [1] caught my eye. The following fix made the benchmark a lot happier (128GB DRAM + Optane swap): Before After Change Average (z/s) 6894 7574 +10% Deviation (10 samples) 12.96% 1.76% -86% [1] https://asc.llnl.gov/codes/proxy-apps/lulesh Andrew, can you please fold it in? Thanks! diff --git a/mm/vmscan.c b/mm/vmscan.c index 90bbc2b3be8b..5e03a61c894f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -916,8 +916,7 @@ static enum folio_references folio_check_references(struct folio *folio, if (!referenced_ptes) return FOLIOREF_RECLAIM; - lru_gen_set_refs(folio); - return FOLIOREF_ACTIVATE; + return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE : FOLIOREF_KEEP; } referenced_folio = folio_test_clear_referenced(folio); @@ -4173,11 +4172,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw) old_gen = folio_update_gen(folio, new_gen); if (old_gen >= 0 && old_gen != new_gen) update_batch_size(walk, folio, old_gen, new_gen); - - continue; - } - - if (lru_gen_set_refs(folio)) { + } else if (lru_gen_set_refs(folio)) { old_gen = folio_lru_gen(folio); if (old_gen >= 0 && old_gen != new_gen) folio_activate(folio);