Re: How to handle TIF_MEMDIE stalls?

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 4 Mar 2015 23:52:46 +1100

On Mon, Mar 02, 2015 at 05:58:23PM +0100, Michal Hocko wrote:
> On Mon 02-03-15 11:39:13, Theodore Ts'o wrote:
> > On Mon, Mar 02, 2015 at 04:18:32PM +0100, Michal Hocko wrote:
> > > The idea is sound. But I am pretty sure we will find many corner
> > > cases. E.g. what if the mere reservation attempt causes the system
> > > to go OOM and trigger the OOM killer?
> > 
> > Doctor, doctor, it hurts when I do that....
> > 
> > So don't trigger the OOM killer.  We can let the caller decide whether
> > the reservation request should block or return ENOMEM, but the whole
> > point of the reservation request idea is that this happens *before*
> > we've taken any mutexes, so blocking won't prevent forward progress.
> 
> Maybe I wasn't clear. I wasn't concerned about the context which
> is doing to reservation. I was more concerned about all the other
> allocation requests which might fail now (becasuse they do not have
> access to the reserves). So you think that we should simply disable OOM
> killer while there is any reservation active? Wouldn't that be even more
> fragile when something goes terribly wrong?

That's a silly strawman.  Why wouldn't you simply block them until
the reserves are released when the transaction completes and the
unused memory goes back to the free pool?

Let me try another tack. My qualifications are as a
distributed control system engineer, not a computer scientist. I
see everything as a system of interconnected feedback loops: an
operating system is nothing but a set of very complex, tightly
interconnected control systems.

Precedence? IO-less dirty throttling - that came about after I'd
been advocating a control theory based algorithm for several years
to solve the breakdown problems of dirty page throttling.  We look
at the code Fenguang Wu wrote as one of the major success stories of
Linux - the writeback code just works and nobody ever has to tune it
anymore.

I see the problem of direct memory reclaim as being very similar to
the problems the old IO based write throttling had: it has unbound
concurrency, severe unfairness and breaks down badly when heavily
loaded.  As a control system, it has the same terrible design
as the IO-based write throttling had.

There are other many similarities, too.

Allocation can only take place at the rate at which reclaim occurs,
and we only have a limited budget of allocatable pages. This is the
same as the dirty page throttling - dirtying pages is limited to the
rate we can clean pages, and there are a limited budget of dirty
pages in the system.

Reclaiming pages is also done most efficiently by a single thread
per zone where lots of internal context can be kept (kswapd). This
is similar to optimal writeback of dirty pages requires a
single thread with internal context per block device..

Waiting for free pages to arrive can be done by an ordered queuing
system, and we can account for the number of pages each allocation
requires in the queueing system and hence only need wake the number
of waiters that will consume the memory just freed. Just like we do
with the the dirty page throttling queue.

As such, the same solutions could be applied. As the allocation
demand exceeds the supply of free pages, we throttle allocation by
sleeping on an ordered queue and only waking waiters at the rate
at which kswapd reclaim can free pages. It's trivial to account
accurately, and the feedback loop is relatively simple, too.

We can also easily maintain a reserve of free pages this way, usable
only by allocation marked with special flags.  The reserve threshold
can be dynamic, and tasks that request it to change can be blocked
until the reserve has been built up to meet caler requirements.
Allocations that are allowed to dip into the reserve may do so
rather than being added to the queue that waits for reclaim.

Reclaim would always fill the reserve back up to it's limits first,
and tasks that have reservations can release them gradually as they
mark them as consumed by the reservation context (e.g. when a
filesystem joins an object to a transaction and modifies it),
thereby reducing the reserve that task has available as it
progresses.

So, there's yet another possible solution to the allocation
reservation problem, and one that solves several other problems that
are being described as making reservation pools difficult or even
impossible to implement.

Seriously, I'm not expecting this problem to be solved tomorrow;
what I want is reliable, deterministic memory allocation behaviour
from the mm subsystem. I want people to be thinking about how to
acheive that rather than limiting their solutions by what we have
now and can hack into the current code, because otherwise we'll
never end up with a reliable memory allocation reservation system....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs