Re: [patch 3/3] mm: memcontrol: fix transparent huge page allocations under pressure

Johannes Weiner <hannes@xxxxxxxxxxx> · Tue, 7 Oct 2014 21:11:06 -0400

On Tue, Oct 07, 2014 at 03:59:50PM +0200, Michal Hocko wrote:
> On Mon 29-09-14 13:57:00, Johannes Weiner wrote:
> > Every change we make is a trade-off and bears a certain risk.  THP is
> > a trade-off, it's pretty pointless to ignore the upsides and ride
> > around on the downsides.  Of course there are downsides.  This patch
> > makes THP work properly inside memcg, which invites both the upsides
> > as well as the downsides of THP into memcg.  But they are well known
> > and we can deal with them. 
> 
> I do not see any evaluation nor discussion of the upsides and downsides
> in the changelog. You are selling this as a net win which I cannot
> agree with.

I'm not sure why you want me to regurgitate the pros and cons of
transparent huge pages here.  They have been a well-established
default feature for a while now, and they are currently not working
properly inside memcg, which this patch addresses.

The only valid argument against merging this patch at this point can
be that THP inside memcg will lead to distinct issues that do not
exist on the global level.  So let's find out if there are any, okay?

> I am completely missing any notes about potential excessive
> swapouts or longer reclaim stalls which are a natural side effect of direct
> reclaim with a larger target (or is this something we do not agree on?).

Yes, we disagree here.  Why is reclaiming 2MB once worse than entering
reclaim 16 times to reclaim SWAP_CLUSTER_MAX?  There is no inherent
difference in reclaiming a big chunk and reclaiming many small chunks
that add up to the same size.

It's only different if you don't actually use the full 2MB pages, but
then the issue simply boils down to increased memory consumption.  But
that is easy to deal with and I offered two solutions in my changelog.

> What is an admin/user supposed to do when one of the above happens?
> Disable THP globally?

I already wrote all that.  It would be easier if you read my entire
line of reasoning instead of attacking fragments in isolation, so that
we can make forward progress on this.

> I still remember when THP was introduced and we have seen boatload of
> reclaim related bugs. These were exactly those long stalls, excessive
> swapouts and reclaim.

THP certainly had a bumpy introduction, and I can understand that you
want to prevent the same happening to memcg.

But it's important to evaluate which THP costs actually translate to
memcg.  The worst problems we had came *not* from faulting in bigger
steps, but from creating physically contiguous pages: aggressive lumpy
reclaim, (synchroneous) migration, reclaim beyond the allocation size
to create breathing room for compaction etc.  This is a massive amount
of work ON TOP of the bigger fault granularity.

Memcg only has to reclaim the allocation size in individual 4k pages.
Our only risk from THP is internal fragmentation from users not fully
utilizing the entire 2MB regions, but the work we MIGHT waste is
negligible compared to the work we are DEFINITELY wasting right now by
failing to charge already allocated THPs.

> > Why is THP inside memcg special?
> 
> For one thing the global case is hitting its limit (watermarks) much
> more slowly and gracefully because it has kswapd working on the
> background before we are getting into troubles. Memcg will just hit the
> wall and rely solely on the direct reclaim so everything we do will end
> up latency sensitive.

THP allocations do not wake up kswapd, they go into direct reclaim.
It's likely that kswapd balancing triggered by concurrent order-0
allocations will help THP somewhat, but because the global level needs
contiguous pages, it will still likely enter direct reclaim and direct
compaction.  For example, on my 16G desktop, there are 12MB between
the high and low watermark in the Normal zone; compaction needs double
the allocation size to work, so kswapd can cache reclaim work for up
to 3 THP in the best case (which those concurrent order-0 allocations
that woke kswapd in the first place will likely eat into), and direct
compaction will still have to run.

So AFAICS the synchroneous work required to fit a THP inside memcg is
much less.  And again, under pressure all this global work is already
expensed at that point anyway.

> Moreover, THP allocations have self regulatory mechanisms to prevent
> from excessive stalls. This means that THP allocations are less probable
> under heavy memory pressure.

These mechanisms exist for migration/compaction, but direct reclaim is
still fairly persistent - again, see should_continue_reclaim().

Am I missing something?

> On the other hand, memcg might be under serious memory pressure when
> THP charge comes. The only back off mechanism we use in memcg is
> GFP_NORETRY and that happens after one round of the reclaim. So we
> should make sure that the first round of the reclaim doesn't take
> terribly long.

The same applies globally.  ANY allocation under serious memory
pressure will have a high latency, but nobody forces you to use THP in
an already underprovisioned environment.

> Another part that matters is the size. Memcgs might be really small and
> that changes the math. Large reclaim target will get to low prio reclaim
> and thus the excessive reclaim.

I already addressed page size vs. memcg size before.

However, low priority reclaim does not result in excessive reclaim.
The reclaim goal is checked every time it scanned SWAP_CLUSTER_MAX
pages, and it exits if the goal has been met.  See shrink_lruvec(),
shrink_zone() etc.

> The size also makes any potential problem much more probable because the
> limit would be hit much more often than extremely low memory conditions
> globally.
>
> Also the reclaim decisions are subtly different for memcg because of the
> missing per-memcg dirty throttling and flushing. So we can stall on
> pages under writeback or get stuck in the write out path which is not
> the case for direct reclaim during THP allocation. A large reclaim
> target is more probable to hit into dirty or writeback pages.

These things again boil down to potential internal fragmentation and
higher memory consumption, as 16 128k reclaims are equally likely to
hit both problems as one 2MB reclaim.

> > Preventing THP faults from swapping is a reasonable proposal, but
> > again has nothing to do with memcg.
> 
> If we can do this inside the direct reclaim path then I am all for it
> because this means less trickery in the memcg code.
> 
> I am still not sure this is sufficient because memcg still might stall
> on IO so the safest approach would be ~GFP_IO reclaim for memcg reclaim
> path.
> 
> I feel strong about the first one (.may_swap = 0) and would be OK with
> your patch if this is added (to the memcg or common path).
> GFP_IO is an extra safety step. Smaller groups would be more likely to
> fail to reclaim enough and so THP success rate will be lower but that
> doesn't sound terribly wrong to me. I am not insisting on it, though.

Would you like to propose the no-swapping patch for the generic
reclaim code?  I'm certainly not against it, but I think the reason
nobody has proposed this yet is that the VM is heavily tuned to prefer
cache reclaim anyway and it's rare that environments run out of cache
and actually swap.  It usually means that memory is underprovisioned.

So I wouldn't be opposed to it as a fail-safe, in case worst comes to
worst, but I think it's a lot less important than you do.

> > However, in this particular case a regression is trivial to pinpoint
> > (comparing vmstat, profiles), and trivial to rectify in the field by
> > changing the memcg limits or disabling THP.
> 
> > What we DO know is that there are very good use cases for THP, but THP
> > inside memcg is broken:
> 
> All those usecases rely on amortizing THP initial costs by less faults
> (assuming the memory range is not used sparsely too much) and the TLB
> pressure reduction. Once we are hitting swap or excessive reclaim all
> the bets are off and THP is no longer beneficial.

Yes, we agree on this, just disagree on the importance of that case.
And both problem and solution would be unrelated to this patch.

> > THP does worse inside a memcg when compared to
> > bare metal environments of the same size, both in terms of success
> > rate, as well as in fault latency due to wasted page allocator work.
> 
> Because memcg is not equivalent to the bare metal with the same amount
> of memory. If for nothing else then because the background reclaim is
> missing.

Which THP is explicitely not using globally.

> > Plus, the code is illogical, redundant, and full of magic numbers.
> 
> I am not objecting to the removal of magic numbers and to getting rid of
> retry loops outside of direct reclaim path (aka mem_cgroup_reclaim). I
> would be willing to take a risk and get rid of them just to make the
> code saner. Because those were never justified properly and look more or
> less random. This would be a separate patch of course.
>  
> > Based on this, this patch seems like a net improvement.
> 
> Sigh, yes, if we ignore all the downsides everything will look like a
> net improvement :/

I don't think you honestly read my email.

> > > > This brings memcg's THP policy in line with the system policy: if the
> > > > allocator painstakingly assembles a hugepage, memcg will at least make
> > > > an honest effort to charge it.  As a result, transparent hugepage
> > > > allocation rates amid cache activity are drastically improved:
> > > > 
> > > >                                       vanilla                 patched
> > > > pgalloc                 4717530.80 (  +0.00%)   4451376.40 (  -5.64%)
> > > > pgfault                  491370.60 (  +0.00%)    225477.40 ( -54.11%)
> > > > pgmajfault                    2.00 (  +0.00%)         1.80 (  -6.67%)
> > > > thp_fault_alloc               0.00 (  +0.00%)       531.60 (+100.00%)
> > > > thp_fault_fallback          749.00 (  +0.00%)       217.40 ( -70.88%)
> > > 
> > > What is the load and configuration that you have measured?
> > 
> > It's just a single linear disk writer and another thread that faults
> > in an anonymous range in 4k steps.
> 
> This is really vague description...
> Which portion of the limit is the anon consumer, what is the memcg limit
> size, IO size, etc...? I find it really interesting that _all_ THP
> charges failed so the memcg had to be almost fully populated by the page
> cache already when the thread tries so fault in the first huge page.
>
> Also 4k steps is basically the best case for THP because the full THP
> block is populated. The question is how the system behaves when THP
> ranges are populated sparsely (because this is often the case).

You are missing the point :(

Sure there are cases that don't benefit from THP, this test just shows
that THP inside memcg can be trivially broken - which harms cases that
WOULD benefit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>