Re: How to handle TIF_MEMDIE stalls?

"Theodore Ts'o" <tytso@xxxxxxx> · Tue, 24 Feb 2015 10:20:33 -0500

On Tue, Feb 24, 2015 at 08:20:11PM +0900, Tetsuo Handa wrote:
> > In a timeout based solution, this would be detected and another thread 
> > would be chosen for oom kill.  There's currently no way for the oom killer 
> > to select a process that isn't waiting for that same mutex, however.  If 
> > it does, then the process has been killed needlessly since it cannot make 
> > forward progress itself without grabbing the mutex.
> 
> Right. The OOM killer cannot understand that there is such lock dependency....

> The memory reserves are something like a balloon. To guarantee forward
> progress, the balloon must not become empty. All memory managing techniques
> except the OOM killer are trying to control "deflator of the balloon" via
> various throttling heuristics. On the other hand, the OOM killer is the only
> memory managing technique which is trying to control "inflator of the balloon"
> via several throttling heuristics.....

The mm developers have suggested in the past whether we could solve
problems by preallocating memory in advance.  Sometimes this is very
hard to do because we don't know exactly how much or if we need
memory, or in order to do this, we would need to completely
restructure the code because the memory allocation is happening deep
in the call stack, potentially in some other subsystem.

So I wonder if we can solve the problem by having a subsystem
reserving memory in advance of taking the mutexes.  We do something
like this in ext3/ext4 --- when we allocate a (sub-)transaction
handle, we give a worst case estimate of how many blocks we might need
to dirty under that handle, and if there isn't enough space in the
journal, we block in the start_handle() call while the current
transaction is closed, and the transaction handle will be attached to
the next transaction.

In the memory allocation scenario, it's a bit more complicated, since
the memory might be allocated in a slab that requires a higher-order
page allocation, but would it be sufficient if we do something rough
where the foreground kernel thread "reserves" a few pages before it
starts doing something that requires mutexes.  The reservation would
be reserved on an accounting basis, and kernel codepath which has
reserved pages would get priority over kernel threads running under a
task_struct which hsa not reserved pages.  If there the system doesn't
have enough pages available, then the reservation request would block
the process until more memory is available.

This wouldn't necessary help in cases where the memory is required for
cleaning dirty pages (although in those cases you really *do* want to
let the memory allocation succeed --- so maybe there should be a way
to hint to the mm subsystem that a memory allocation should be given
higher priority since it might help get the system out of the ham that
it is in).

However, for "normal" operations, where blocking a process who was
about to execute, say, a read(2) or a open(2) system call early,
*before* it takes some mutex, it owuld be good if we could provide a
certain amount of admission control when memory pressure is specially
high.

Would this be a viable strategy?

Even if this was a hint that wasn't perfect (i.e., it some cases a
kernel thread might end up requiring more pages than it had hinted,
which would not be considered fatal, although the excess requested
pages would be treated the same way as if no reservation was made at
all, meaning the memory allocation would be more likely to fail and a
GFP_NOFAIL allocation would loop for longer), I would think this could
only help us do a better job of "keeping the baloon from getting
completely deflated".

Cheers,

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>