Re: Memcg stat for available memory

David Rientjes <rientjes@xxxxxxxxxx> · Sun, 12 Jul 2020 15:02:15 -0700 (PDT)

On Fri, 10 Jul 2020, Yang Shi wrote:

> > To try to get more discussion on the subject, consider a malloc
> > implementation, like tcmalloc, that does MADV_DONTNEED to free memory back
> > to the system and how this freed memory is then described to userspace
> > depending on the kernel implementation.
> >
> >  [ For the sake of this discussion, consider we have precise memcg stats
> >    available to us although the actual implementation allows for some
> >    variance (MEMCG_CHARGE_BATCH). ]
> >
> > With a 64MB heap backed by thp on x86, for example, the vma starts with an
> > rss of 64MB, all of which is anon and backed by hugepages.  Imagine some
> > aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page
> > mapped in each 2MB aligned range.  The rss is now 32 * 4KB = 128KB.
> >
> > Before freeing, anon, anon_thp, and active_anon in memory.stat would all
> > be the same for this vma (64MB).  64MB would also be charged to
> > memory.current.  That's all working as intended and to the expectation of
> > userspace.
> >
> > After freeing, however, we have the kernel implementation specific detail
> > of how huge pmd splitting is handled (rss) in comparison to the underlying
> > split of the compound page (deferred split queue).  The huge pmd is always
> > split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB
> > for this vma and none of it is backed by thp.
> >
> > What is charged to the memcg (memory.current) and what is on active_anon
> > is unchanged, however, because the underlying compound pages are still
> > charged to the memcg.  The amount of anon and anon_thp are decreased
> > in compliance with the splitting of the page tables, however.
> >
> > So after freeing, for this vma: anon = 128KB, anon_thp = 0,
> > active_anon = 64MB, memory.current = 64MB.
> >
> > In this case, because of the deferred split queue, which is a kernel
> > implementation detail, userspace may be unclear on what is actually
> > reclaimable -- and this memory is reclaimable under memory pressure.  For
> > the motivation of MemAvailable (what amount of memory is available for
> > starting new work), userspace *could* determine this through the
> > aforementioned active_anon - anon (or some combination of
> > memory.current - anon - file - slab), but I think it's a fair point that
> > userspace's view of reclaimable memory as the kernel implementation
> > changes is something that can and should remain consistent between
> > versions.
> >
> > Otherwise, an earlier implementation before deferred split queues could
> > have safely assumed that active_anon was unreclaimable unless swap were
> > enabled.  It doesn't have the foresight based on future kernel
> > implementation detail to reconcile what the amount of reclaimable memory
> > actually is.
> >
> > Same discussion could happen for lazy free memory which is anon but now
> > appears on the file lru stats and not the anon lru stats: it's easily
> > reclaimable under memory pressure but you need to reconcile the difference
> > between the anon metric and what is revealed in the anon lru stats.
> >
> > That gave way to my original thought of a si_mem_available()-like
> > calculation ("avail") by doing
> >
> >         free = memory.high - memory.current
> 
> I'm wondering what if high or max is set to max limit. Don't you end
> up seeing a super large memavail?
> 

Hi Yang,

Yes, this would be the same as seeing a super large limit :)

I'm indifferent to whether this is described as an available amount of 
memory (almost identical to MemAvailable) or a best guess of the 
reclaimable amount of memory from the memory that is currently charged.  
Concept is to provide userspace with this best guess like we do for system 
memory through MemAvailable because it (a) depends on implementation 
details in the kernel and (b) is the only way to maintain consistency from 
version to version.

> >         lazyfree = file - (active_file + inactive_file)
> 
> Isn't it (active_file + inactive_file) - file ? It looks MADV_FREE
> just updates inactive lru size.
> 

Yes, you're right, this would be

	lazyfree = (active_file + inactive_file) - file

from memory.stat.  Lazy free memory are clean anon pages on the 
inactive file lru, but we must consider active_file + inactive_file in 
comparison to "file" for the total amount of lazy free.

Another side effect of this is that we'd need anon - lazyfree swap space 
available for this workload to be swapped.

The overall point I'm trying to highlight is that the amount of memory 
that can be freed under memory pressure, either lazy free or on the 
deferred split queues, can be substantial.  I'd like to discuss the 
feasibility of adding this as a kernel maintained stat to memory.stat 
rather than userspace attempting to derive this on its own.

> >         deferred = active_anon - anon
> >
> >         avail = free + lazyfree + deferred +
> >                 (active_file + inactive_file + slab_reclaimable) / 2
> >
> > And we have the ability to change this formula based on kernel
> > implementation details as they evolve.  Idea is to provide a consistent
> > field that userspace can use to determine the rough amount of reclaimable
> > memory in a MemAvailable-like way.