On 8/12/19 2:34 AM, Michal Hocko wrote:
On Fri 09-08-19 16:54:43, Yang Shi wrote:
On 8/9/19 11:26 AM, Yang Shi wrote:
On 8/9/19 11:02 AM, Michal Hocko wrote:
[...]
I have to study the code some more but is there any reason why those
pages are not accounted as proper THPs anymore? Sure they are partially
unmaped but they are still THPs so why cannot we keep them accounted
like that. Having a new counter to reflect that sounds like papering
over the problem to me. But as I've said I might be missing something
important here.
I think we could keep those pages accounted for NR_ANON_THPS since they
are still THP although they are unmapped as you mentioned if we just
want to fix the improper accounting.
By double checking what NR_ANON_THPS really means,
Documentation/filesystems/proc.txt says "Non-file backed huge pages mapped
into userspace page tables". Then it makes some sense to dec NR_ANON_THPS
when removing rmap even though they are still THPs.
I don't think we would like to change the definition, if so a new counter
may make more sense.
Yes, changing NR_ANON_THPS semantic sounds like a bad idea. Let
me try whether I understand the problem. So we have some THP in
limbo waiting for them to be split and unmapped parts to be freed,
right? I can see that page_remove_anon_compound_rmap does correctly
decrement NR_ANON_MAPPED for sub pages that are no longer mapped by
anybody. LRU pages seem to be accounted properly as well. As you've
said NR_ANON_THPS reflects the number of THPs mapped and that should be
reflecting the reality already IIUC.
So the only problem seems to be that deferred THP might aggregate a lot
of immediately freeable memory (if none of the subpages are mapped) and
that can confuse MemAvailable because it doesn't know about the fact.
Has an skewed counter resulted in a user observable behavior/failures?
No. But the skewed counter may make big difference for a big scale
cluster. The MemAvailable is an important factor for cluster scheduler
to determine the capacity.
Even though the scheduler could place one more small container due to
extra available memory, it would make big difference for a cluster with
thousands of nodes.
I can see that memcg rss size was the primary problem David was looking
at. But MemAvailable will not help with that, right? Moreover is
Yes, but David actually would like to have memcg MemAvailable (the
accounter like the global one), which should be counted like the global
one and should account per memcg deferred split THP properly.
accounting the full THP correct? What if subpages are still mapped?
"Deferred split" definitely doesn't mean they are free. When memory
pressure is hit, they would be split, then the unmapped normal pages
would be freed. So, when calculating MemAvailable, they are not
accounted 100%, but like "available += lazyfree - min(lazyfree / 2,
wmark_low)", just like how page cache is accounted.
We could get more accurate account, i.e. checking each sub page's
mapcount when accounting, but it may change before shrinker start
scanning. So, just use the ballpark estimation to trade off the
complexity for accurate accounting.