Re: How to handle TIF_MEMDIE stalls?

Johannes Weiner <hannes@xxxxxxxxxxx> · Mon, 2 Mar 2015 11:05:37 -0500

On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> On Mon 23-02-15 11:45:21, Dave Chinner wrote:
> [...]
> > A reserve memory pool is no different - every time a memory reserve
> > occurs, a watermark is lifted to accommodate it, and the transaction
> > is not allowed to proceed until the amount of free memory exceeds
> > that watermark. The memory allocation subsystem then only allows
> > *allocations* marked correctly to allocate pages from that the
> > reserve that watermark protects. e.g. only allocations using
> > __GFP_RESERVE are allowed to dip into the reserve pool.
> 
> The idea is sound. But I am pretty sure we will find many corner
> cases. E.g. what if the mere reservation attempt causes the system
> to go OOM and trigger the OOM killer? Sure that wouldn't be too much
> different from the OOM triggered during the allocation but there is one
> major difference. Reservations need to be estimated and I expect the
> estimation would be on the more conservative side and so the OOM might
> not happen without them.

The whole idea is that filesystems request the reserves while they can
still sleep for progress or fail the macro-operation with -ENOMEM.

And the estimate wouldn't just be on the conservative side, it would
have to be the worst-case scenario.  If we run out of reserves in an
allocation that can not fail that would be a bug that can lock up the
machine.  We can then fall back to the OOM killer in a last-ditch
effort to make forward progress, but as the victim tasks can get stuck
behind state/locks held by the allocation side, the machine might lock
up after all.

> > By using watermarks, freeing of memory will automatically top
> > up the reserve pool which means that we guarantee that reclaimable
> > memory allocated for demand paging during transacitons doesn't
> > deplete the reserve pool permanently.  As a result, when there is
> > plenty of free and/or reclaimable memory, the reserve pool
> > watermarks will have almost zero impact on performance and
> > behaviour.
> 
> Typical busy system won't be very far away from the high watermark
> so there would be a reclaim performed during increased watermaks
> (aka reservation) and that might lead to visible performance
> degradation. This might be acceptable but it also adds a certain level
> of unpredictability when performance characteristics might change
> suddenly.

There is usually a good deal of clean cache.  As Dave pointed out
before, clean cache can be considered re-allocatable from NOFS
contexts, and so we'd only have to maintain this invariant:

	min_wmark + private_reserves < free_pages + clean_cache

> > Further, because it's just accounting and behavioural thresholds,
> > this allows the mm subsystem to control how the reserve pool is
> > accounted internally. e.g. clean, reclaimable pages in the page
> > cache could serve as reserve pool pages as they can be immediately
> > reclaimed for allocation.
> 
> But they also can turn into hard/impossible to reclaim as well. Clean
> pages might get dirty and e.g. swap backed pages run out of their
> backing storage. So I guess we cannot count with those pages without
> reclaiming them first and hiding them into the reserve. Which is what
> you suggest below probably but I wasn't really sure...

Pages reserved for use by the page cleaning path can't be considered
dirtyable.  They have to be included in the dirty_balance_reserve.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>