Re: Memcg stat for available memory

David Rientjes <rientjes@xxxxxxxxxx> · Fri, 10 Jul 2020 12:47:55 -0700 (PDT)

On Tue, 7 Jul 2020, David Rientjes wrote:

> Another use case would be motivated by exactly the MemAvailable use case: 
> when bound to a memcg hierarchy, how much memory is available without 
> substantial swap or risk of oom for starting a new process or service?  
> This would not trigger any memory.low or PSI notification but is a 
> heuristic that can be used to determine what can and cannot be started 
> without incurring substantial memory reclaim.
> 
> I'm indifferent to whether this would be a "reclaimable" or "available" 
> metric, with a slight preference toward making it as similar in 
> calculation to MemAvailable as possible, so I think the question is 
> whether this is something the user should be deriving themselves based on 
> memcg stats that are exported or whether we should solidify this based on 
> how the kernel handles reclaim as a metric that will carry over across 
> kernel vesions?
> 

To try to get more discussion on the subject, consider a malloc 
implementation, like tcmalloc, that does MADV_DONTNEED to free memory back 
to the system and how this freed memory is then described to userspace 
depending on the kernel implementation.

 [ For the sake of this discussion, consider we have precise memcg stats 
   available to us although the actual implementation allows for some
   variance (MEMCG_CHARGE_BATCH). ]

With a 64MB heap backed by thp on x86, for example, the vma starts with an 
rss of 64MB, all of which is anon and backed by hugepages.  Imagine some 
aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page 
mapped in each 2MB aligned range.  The rss is now 32 * 4KB = 128KB.

Before freeing, anon, anon_thp, and active_anon in memory.stat would all 
be the same for this vma (64MB).  64MB would also be charged to 
memory.current.  That's all working as intended and to the expectation of 
userspace.

After freeing, however, we have the kernel implementation specific detail 
of how huge pmd splitting is handled (rss) in comparison to the underlying 
split of the compound page (deferred split queue).  The huge pmd is always 
split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB 
for this vma and none of it is backed by thp.

What is charged to the memcg (memory.current) and what is on active_anon 
is unchanged, however, because the underlying compound pages are still 
charged to the memcg.  The amount of anon and anon_thp are decreased 
in compliance with the splitting of the page tables, however.

So after freeing, for this vma: anon = 128KB, anon_thp = 0, 
active_anon = 64MB, memory.current = 64MB.

In this case, because of the deferred split queue, which is a kernel 
implementation detail, userspace may be unclear on what is actually 
reclaimable -- and this memory is reclaimable under memory pressure.  For 
the motivation of MemAvailable (what amount of memory is available for 
starting new work), userspace *could* determine this through the 
aforementioned active_anon - anon (or some combination of
memory.current - anon - file - slab), but I think it's a fair point that 
userspace's view of reclaimable memory as the kernel implementation 
changes is something that can and should remain consistent between 
versions.

Otherwise, an earlier implementation before deferred split queues could 
have safely assumed that active_anon was unreclaimable unless swap were 
enabled.  It doesn't have the foresight based on future kernel 
implementation detail to reconcile what the amount of reclaimable memory 
actually is.

Same discussion could happen for lazy free memory which is anon but now 
appears on the file lru stats and not the anon lru stats: it's easily 
reclaimable under memory pressure but you need to reconcile the difference 
between the anon metric and what is revealed in the anon lru stats.

That gave way to my original thought of a si_mem_available()-like 
calculation ("avail") by doing

	free = memory.high - memory.current
	lazyfree = file - (active_file + inactive_file)
	deferred = active_anon - anon

	avail = free + lazyfree + deferred +
		(active_file + inactive_file + slab_reclaimable) / 2

And we have the ability to change this formula based on kernel 
implementation details as they evolve.  Idea is to provide a consistent 
field that userspace can use to determine the rough amount of reclaimable 
memory in a MemAvailable-like way.