[LSF/MM ATTEND] File caching & memory cgroups

Johannes Weiner <jweiner@xxxxxxxxxx> · Wed, 8 Jan 2014 23:50:37 -0500

Hi,

I would like to attend LSF/MM to discuss memory management topics,
specifically the following issues:

Cache allocation on NUMA
------------------------

The default allocation policy on NUMA is to always try the local node
first, then fall back to remote nodes.  When all nodes are at their
low watermark, the kswapd of each node is woken up and the allocator
retries.

Here is the problem: as soon as kswapd starts freeing local memory,
subsequent local allocation attempts will succeed again.  But at the
same time, they'll prevent the local kswapd from restoring the high
watermark and going back to sleep.  If the paces of reclaim and
allocations match up, kswapd will keep the local node allocatable and
the stream of allocations will keep kswapd awake.

If the workingset is bigger than the local node, we end up thrashing
it while there is free remote memory readily available.  Obviously,
the IO cost is higher than the cost of remote references for most
secondary storage.

For anonymous memory this phenomenon is not as severe because people
try to match anon size to node size and avoid anon reclaim (swapping)
as much as possible.

But cache often exceeds the local node size and clean cache is quickly
reclaimable, which makes the scenario for cache very likely and
observable in practice.

How do we get this right?

One idea would be to have kswapd bail after it reclaimed high-low
watermark number of pages.  However, the above described behavior
might not be entirely undesirable, for example when kswapd reclaims
cache to allow anonymous memory to be placed locally, comparable to
zone_reclaim_mode=1 but without all the direct reclaim latency.  It
would also reduce kswapd's effectiveness at reducing latency when it
IS legitimate to keep running, i.e. when all nodes ARE in equal use.

Another idea would be to change the default allocation policy such
that cache allocations default to a round-robin policy.  This is a
drawback for workloads whose workingset including cache does not
exceed the local node, but those we could offer a mempolicy, while
providing a sensible default.

The question here would be how we approach this problem, whether the
solution involves user interface changes, and what the default
behavior should be.

File LRU balancing
------------------

The Linux VM has two LRU lists for file pages: the "inactive" list for
recently faulted pages, and the "active" list to which pages get
promoted when they are accessed multiple times.  Linux also reclaims
lazily, which means that there is usually no free memory left and the
inactive and active list share all available memory.  As the active
list grows, it reduces the available space for the inactive list.

A smaller inactive list means faster eviction for the pages on it, and
so a page that might have been activated with a small active list
might be evicted before its second access when the active list is
bigger.  This in turn means that, as a workingset establishes itself
on the active list, the current VM turns blind to workingset changes.

The result is a complete breakdown of the VM's caching abilities when
a workingset change exceeds the inactive list size (fixed minimum of
50% of memory at this time, which has other downsides as well).

I have sent patches that set out to fix this problem.  They accomplish
this by remembering eviction information as inactive pages get
reclaimed, and then use this information to reconstruct the access
distance when the page refaults.  If the distance is within the
theoretical maximum size of the inactive list (inactive + active), the
page gets activated directly.  This makes the multi-access detection
immune to the physical inactive/active size balance.

People seemed interested in this at last year's LSF/MM, but the actual
patch submissions have seen very little response from MM people on the
lists, so I'd like to bring this up again.

Zeroconf memcgs
---------------

Memory cgroups, outside of the pay-per-container usecase, are awkward
to configure because it needs precise knowledge of the workload.  Task
grouping is one thing, but finding a static upper memory limit for any
given application is tough: it's not trivial to estimate a workingset
size, and it varies during execution.  How much memory do I grant an
rsync backup?  A build job?

The idea with zeroconf memcgs would be a respin of "local" reclaim
policies on a per-memcg basis.  No upper memory limit is defined and a
memcg consumes whatever physical memory is readily available.  But as
soon as memory is exhausted and the task has to initiate reclaim, it
would not reclaim GLOBALLY from all memcgs in the system, as is the
case right now.  Instead, it would try to reclaim its own clean cache
first, and fall back if there is no clean cache or if that cache is
thrashing.  It's conceivable to use the refault information as
described in "File LRU balancing" to detect such thrashing.  It would
also fall back when the readahead window is being thrashed,
i.e. !PageReferenced pages are reclaimed.

On a populated system, an rsync backup workload would stay relatively
contained by recycling its own used-once cache before stealing memory
from other workloads, but WOULD use global reclaim in order to expand
to the size of its readahead window.

Such "local page replacement" policies on a per-task level have had
limited success in the past.  "Task" often does not correspond to
"workload" and so this could easily end up doing the wrong thing.
Memory cgroups on the other hand ARE tasks grouped by workload.  In
addition, we now have means to restrict a reclaim scan to file cache,
thanks to the split LRU lists.  We also have means to detect cache
thrashing thanks to the refault information.

With this new infrastructure in place, would it be a good idea to give
local reclaim another shot as a memcg feature?  Even consider the
memory equivalent of CONFIG_SCHED_AUTOGROUP?

Memcg upper & lower limit
-------------------------

Another idea to make memory cgroups more approachable by casual users
would be to rethink the default behavior of the upper limit, currently
"hardlimit".  This idea came from Tejun.

Currently, when the hardlimit is reached and reclaim can not make
progress, a per-memcg OOM killer is invoked.  For most usecases, again
outside of pay-per-container situations, this seems quite harsh.  It's
likely that most people would rather set the limit a little lower than
higher, have reclaim try to enforce it, but ultimately let the
allocation pass.  A best effort measure.

The OOM killer was written for situations of hopeless overcommit where
the kernel simply has no other choice.  Obviously this behavior should
still be available for cases where the applications are untrusted, but
maybe the default should be less extreme.

As to the lower limit, Michal already proposed this for discussion.
This is about guaranteeing groups a minimum amount of memory.  Here I
would think we want behavior that is symmetrical to the upper limit,
whatever we end up deciding on.  But it's also likely that the default
should be a soft failure mode of bypassing the limit instead of a hard
failure mode that invokes the OOM killer or even panics the system.

In the light of the cgroup interface being revamped entirely, it might
make sense to discuss a new interface for the memory limits as well
and treat the lower and the upper limit as two halves of the same
thing, unlike the current semantical nightmare of soft and hard limit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>