On Wed, 4 Sep 2024 09:27:40 -0700 Davidlohr Bueso <dave@xxxxxxxxxxxx> wrote: > This adds support for allowing proactive reclaim in general on a > NUMA system. A per-node interface extends support for beyond a > memcg-specific interface, respecting the current semantics of > memory.reclaim: respecting aging LRU and not supporting > artificially triggering eviction on nodes belonging to non-bottom > tiers. > > This patch allows userspace to do: > > echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim One value per sysfs file is a rule. > One of the premises for this is to semantically align as best as > possible with memory.reclaim. During a brief time memcg did > support nodemask until 55ab834a86a9 (Revert "mm: add nodes= > arg to memory.reclaim"), for which semantics around reclaim > (eviction) vs demotion were not clear, rendering charging > expectations to be broken. > > With this approach: > > 1. Users who do not use memcg can benefit from proactive reclaim. > > 2. Proactive reclaim on top tiers will trigger demotion, for which > memory is still byte-addressable. Reclaiming on the bottom nodes > will trigger evicting to swap (the traditional sense of reclaim). > This follows the semantics of what is today part of the aging process > on tiered memory, mirroring what every other form of reclaim does > (reactive and memcg proactive reclaim). Furthermore per-node proactive > reclaim is not as susceptible to the memcg charging problem mentioned > above. > > 3. Unlike memcg, there should be no surprises of callers expecting > reclaim but instead got a demotion. Essentially relying on behavior > of shrink_folio_list() after 6b426d071419 (mm: disable top-tier > fallback to reclaim on proactive reclaim), without the expectations > of try_to_free_mem_cgroup_pages(). > > 4. Unlike the nodes= arg, this interface avoids confusing semantics, > such as what exactly the user wants when mixing top-tier and low-tier > nodes in the nodemask. Further per-node interface is less exposed to > "free up memory in my container" usecases, where eviction is intended. > > 5. Users that *really* want to free up memory can use proactive reclaim > on nodes knowingly to be on the bottom tiers to force eviction in a > natural way - higher access latencies are still better than swap. > If compelled, while no guarantees and perhaps not worth the effort, > users could also also potentially follow a ladder-like approach to > eventually free up the memory. Alternatively, perhaps an 'evict' option > could be added to the parameters for both memory.reclaim and per-node > interfaces to force this action unconditionally. > > ... > > --- a/Documentation/ABI/stable/sysfs-devices-node > +++ b/Documentation/ABI/stable/sysfs-devices-node > @@ -221,3 +221,14 @@ Contact: Jiaqi Yan <jiaqiyan@xxxxxxxxxx> > Description: > Of the raw poisoned pages on a NUMA node, how many pages are > recovered by memory error recovery attempt. > + > +What: /sys/devices/system/node/nodeX/reclaim > +Date: September 2024 > +Contact: Linux Memory Management list <linux-mm@xxxxxxxxx> > +Description: > + This is write-only nested-keyed file which accepts the number of "is a write-only". What does "nested keyed" mean? > + bytes to reclaim as well as the swappiness for this particular > + operation. Write the amount of bytes to induce memory reclaim in > + this node. When it completes successfully, the specified amount > + or more memory will have been reclaimed, and -EAGAIN if less > + bytes are reclaimed than the specified amount. Could be that this feature would benefit from a more expansive treatment under Documentation/somewhere. > > ... > > +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) > + > +enum { > + MEMORY_RECLAIM_SWAPPINESS = 0, > + MEMORY_RECLAIM_NULL, > +}; > + > +static const match_table_t tokens = { > + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, > + { MEMORY_RECLAIM_NULL, NULL }, > +}; > + > +static ssize_t reclaim_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + int nid = dev->id; > + gfp_t gfp_mask = GFP_KERNEL; > + struct pglist_data *pgdat = NODE_DATA(nid); > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > + int swappiness = -1; > + char *old_buf, *start; > + substring_t args[MAX_OPT_ARGS]; > + struct scan_control sc = { > + .gfp_mask = current_gfp_context(gfp_mask), > + .reclaim_idx = gfp_zone(gfp_mask), > + .priority = DEF_PRIORITY, > + .may_writepage = !laptop_mode, > + .may_unmap = 1, > + .may_swap = 1, > + .proactive = 1, > + }; > + > + buf = strstrip((char *)buf); > + > + old_buf = (char *)buf; > + nr_to_reclaim = memparse(buf, (char **)&buf) / PAGE_SIZE; > + if (buf == old_buf) > + return -EINVAL; > + > + buf = strstrip((char *)buf); > + > + while ((start = strsep((char **)&buf, " ")) != NULL) { > + if (!strlen(start)) > + continue; > + switch (match_token(start, tokens, args)) { > + case MEMORY_RECLAIM_SWAPPINESS: > + if (match_int(&args[0], &swappiness)) > + return -EINVAL; > + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) > + return -EINVAL; Code forgot to use local `swappiness' for any purpose? > + break; > + default: > + return -EINVAL; > + } > + } > + > > ... >