On Wed, 6 Apr 2022 21:15:22 -0600 Yu Zhao <yuzhao@xxxxxxxxxx> wrote: > Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that > can be disabled include: > 0x0001: the multi-gen LRU core > 0x0002: walking page table, when arch_has_hw_pte_young() returns > true > 0x0004: clearing the accessed bit in non-leaf PMD entries, when > CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y > [yYnN]: apply to all the components above > E.g., > echo y >/sys/kernel/mm/lru_gen/enabled > cat /sys/kernel/mm/lru_gen/enabled > 0x0007 > echo 5 >/sys/kernel/mm/lru_gen/enabled > cat /sys/kernel/mm/lru_gen/enabled > 0x0005 I'm shocked that this actually works. How does it work? Existing pages & folios are drained over time or synchrnously? Supporting structures remain allocated, available for reenablement? Why is it thought necessary to have this? Is it expected to be permanent? > NB: the page table walks happen on the scale of seconds under heavy > memory pressure, in which case the mmap_lock contention is a lesser > concern, compared with the LRU lock contention and the I/O congestion. > So far the only well-known case of the mmap_lock contention happens on > Android, due to Scudo [1] which allocates several thousand VMAs for > merely a few hundred MBs. The SPF and the Maple Tree also have > provided their own assessments [2][3]. However, if walking page tables > does worsen the mmap_lock contention, the kill switch can be used to > disable it. In this case the multi-gen LRU will suffer a minor > performance degradation, as shown previously. > > Clearing the accessed bit in non-leaf PMD entries can also be > disabled, since this behavior was not tested on x86 varieties other > than Intel and AMD. > > ... > > --- a/include/linux/cgroup.h > +++ b/include/linux/cgroup.h > @@ -432,6 +432,18 @@ static inline void cgroup_put(struct cgroup *cgrp) > css_put(&cgrp->self); > } > > +extern struct mutex cgroup_mutex; > + > +static inline void cgroup_lock(void) > +{ > + mutex_lock(&cgroup_mutex); > +} > + > +static inline void cgroup_unlock(void) > +{ > + mutex_unlock(&cgroup_mutex); > +} It's a tad rude to export mutex_lock like this without (apparently) informing its owner (Tejun). And if we're going to wrap its operations via helper fuctions then - presumably all cgroup_mutex operations should be wrapped and - exiting open-coded operations on this mutex should be converted. > > ... > > +static bool drain_evictable(struct lruvec *lruvec) > +{ > + int gen, type, zone; > + int remaining = MAX_LRU_BATCH; > + > + for_each_gen_type_zone(gen, type, zone) { > + struct list_head *head = &lruvec->lrugen.lists[gen][type][zone]; > + > + while (!list_empty(head)) { > + bool success; > + struct folio *folio = lru_to_folio(head); > + > + VM_BUG_ON_FOLIO(folio_test_unevictable(folio), folio); > + VM_BUG_ON_FOLIO(folio_test_active(folio), folio); > + VM_BUG_ON_FOLIO(folio_is_file_lru(folio) != type, folio); > + VM_BUG_ON_FOLIO(folio_zonenum(folio) != zone, folio); So many new BUG_ONs to upset Linus :( > + success = lru_gen_del_folio(lruvec, folio, false); > + VM_BUG_ON(!success); > + lruvec_add_folio(lruvec, folio); > + > + if (!--remaining) > + return false; > + } > + } > + > + return true; > +} > + > > ... > > +static ssize_t store_enable(struct kobject *kobj, struct kobj_attribute *attr, > + const char *buf, size_t len) > +{ > + int i; > + unsigned int caps; > + > + if (tolower(*buf) == 'n') > + caps = 0; > + else if (tolower(*buf) == 'y') > + caps = -1; > + else if (kstrtouint(buf, 0, &caps)) > + return -EINVAL; See kstrtobool() > + for (i = 0; i < NR_LRU_GEN_CAPS; i++) { > + bool enable = caps & BIT(i); > + > + if (i == LRU_GEN_CORE) > + lru_gen_change_state(enable); > + else if (enable) > + static_branch_enable(&lru_gen_caps[i]); > + else > + static_branch_disable(&lru_gen_caps[i]); > + } > + > + return len; > +} > > ... >