On Fri, Jul 10, 2020 at 12:49 PM David Rientjes <rientjes@xxxxxxxxxx> wrote: > > On Tue, 7 Jul 2020, David Rientjes wrote: > > > Another use case would be motivated by exactly the MemAvailable use case: > > when bound to a memcg hierarchy, how much memory is available without > > substantial swap or risk of oom for starting a new process or service? > > This would not trigger any memory.low or PSI notification but is a > > heuristic that can be used to determine what can and cannot be started > > without incurring substantial memory reclaim. > > > > I'm indifferent to whether this would be a "reclaimable" or "available" > > metric, with a slight preference toward making it as similar in > > calculation to MemAvailable as possible, so I think the question is > > whether this is something the user should be deriving themselves based on > > memcg stats that are exported or whether we should solidify this based on > > how the kernel handles reclaim as a metric that will carry over across > > kernel vesions? > > > > To try to get more discussion on the subject, consider a malloc > implementation, like tcmalloc, that does MADV_DONTNEED to free memory back > to the system and how this freed memory is then described to userspace > depending on the kernel implementation. > > [ For the sake of this discussion, consider we have precise memcg stats > available to us although the actual implementation allows for some > variance (MEMCG_CHARGE_BATCH). ] > > With a 64MB heap backed by thp on x86, for example, the vma starts with an > rss of 64MB, all of which is anon and backed by hugepages. Imagine some > aggressive MADV_DONTNEED freeing that ends up with only a single 4KB page > mapped in each 2MB aligned range. The rss is now 32 * 4KB = 128KB. > > Before freeing, anon, anon_thp, and active_anon in memory.stat would all > be the same for this vma (64MB). 64MB would also be charged to > memory.current. That's all working as intended and to the expectation of > userspace. > > After freeing, however, we have the kernel implementation specific detail > of how huge pmd splitting is handled (rss) in comparison to the underlying > split of the compound page (deferred split queue). The huge pmd is always > split synchronously after MADV_DONTNEED so, as mentioned, the rss is 128KB > for this vma and none of it is backed by thp. > > What is charged to the memcg (memory.current) and what is on active_anon > is unchanged, however, because the underlying compound pages are still > charged to the memcg. The amount of anon and anon_thp are decreased > in compliance with the splitting of the page tables, however. > > So after freeing, for this vma: anon = 128KB, anon_thp = 0, > active_anon = 64MB, memory.current = 64MB. > > In this case, because of the deferred split queue, which is a kernel > implementation detail, userspace may be unclear on what is actually > reclaimable -- and this memory is reclaimable under memory pressure. For > the motivation of MemAvailable (what amount of memory is available for > starting new work), userspace *could* determine this through the > aforementioned active_anon - anon (or some combination of > memory.current - anon - file - slab), but I think it's a fair point that > userspace's view of reclaimable memory as the kernel implementation > changes is something that can and should remain consistent between > versions. > > Otherwise, an earlier implementation before deferred split queues could > have safely assumed that active_anon was unreclaimable unless swap were > enabled. It doesn't have the foresight based on future kernel > implementation detail to reconcile what the amount of reclaimable memory > actually is. > > Same discussion could happen for lazy free memory which is anon but now > appears on the file lru stats and not the anon lru stats: it's easily > reclaimable under memory pressure but you need to reconcile the difference > between the anon metric and what is revealed in the anon lru stats. > > That gave way to my original thought of a si_mem_available()-like > calculation ("avail") by doing > > free = memory.high - memory.current I'm wondering what if high or max is set to max limit. Don't you end up seeing a super large memavail? > lazyfree = file - (active_file + inactive_file) Isn't it (active_file + inactive_file) - file ? It looks MADV_FREE just updates inactive lru size. > deferred = active_anon - anon > > avail = free + lazyfree + deferred + > (active_file + inactive_file + slab_reclaimable) / 2 > > And we have the ability to change this formula based on kernel > implementation details as they evolve. Idea is to provide a consistent > field that userspace can use to determine the rough amount of reclaimable > memory in a MemAvailable-like way. >