David Rientjes wrote: > Perhaps we should consider an alternative: allow threads, such as TaskA, > that are deferring for a long amount of time to simply allocate with > ALLOC_NO_WATERMARKS itself in that scenario in the hope that the > allocation succeeding will eventually allow it to drop the mutex. Two > problems: (1) there's no guarantee that the simple allocation is all TaskA > needs before it will drop the lock and (2) another thread could > immediately grab the same mutex and allocate, in which the same series of > events repeats. We can see that effectively GFP_NOFAIL allocations with a lock held (e.g. filesystem transaction) exist, can't we? ---------------------------------------- TaskA TaskB TaskC TaskD TaskE call mutex_lock() call mutex_lock() call mutex_lock() call mutex_lock() call mutex_lock() do GFP_NOFAIL allocation oom kill TaskA waiting for TaskA to die will do something with allocated memory will call mutex_unlock() will do GFP_NOFAIL allocation will wait for TaskA to die will do something with allocated memory will call mutex_unlock() will do GFP_NOFAIL allocation will wait for TaskA to die will do something with allocated memory will call mutex_unlock() will do GFP_NOFAIL allocation will wait for TaskA to die will do something with allocated memory will call mutex_unlock() will do GFP_NOFAIL allocation ---------------------------------------- Allowing ALLOC_NO_WATERMARKS to TaskB helps nothing. We don't want to allow ALLOC_NO_WATERMARKS to TaskC, TaskD, TaskE and TaskA when they do the same sequence TaskB did, or we will deplete memory reserves. > In a timeout based solution, this would be detected and another thread > would be chosen for oom kill. There's currently no way for the oom killer > to select a process that isn't waiting for that same mutex, however. If > it does, then the process has been killed needlessly since it cannot make > forward progress itself without grabbing the mutex. Right. The OOM killer cannot understand that there is such lock dependency. And do you think there will be a way available for the OOM killer to select a process that isn't waiting for that same mutex in the neare future? (Remembering mutex's address currently waiting for using "struct task_struct" would do, but will not be accepted due to performance penalty. Simplified form would be to check "struct task_struct"->state , but will not be perfect.) > Certainly, it would be better to eventually kill something else in the > hope that it does not need the mutex and will free some memory which would > allow the thread that had originally been deferring forever, TaskA, in the > oom killer waiting for the original victim, TaskB, to exit. If that's the > solution, then TaskA had been killed unnecessarily itself. Complaining about unnecessarily killed processes is preventing us from making forward progress. The memory reserves are something like a balloon. To guarantee forward progress, the balloon must not become empty. All memory managing techniques except the OOM killer are trying to control "deflator of the balloon" via various throttling heuristics. On the other hand, the OOM killer is the only memory managing technique which is trying to control "inflator of the balloon" via several throttling heuristics. The OOM killer is invoked when all memory managing techniques except the OOM killer failed to make forward progress. Therefore, the OOM killer is responsible for making forward progress for "deflator of the balloon" and is granted the prerogative to send SIGKILL to any process. Given the fact that the OOM killer cannot understand lock dependency and there are effectively GFP_NOFAIL allocations, it is inevitable that the OOM killer fails to choose one correct process that will make forward progress. Currently the OOM killer is invoked as one shot mode. This mode helps us to reduce the possibility of depleting the memory reserves and killing processes unnecessarily. But this mode is bothering people with "silently stalling forever" problem when the bullet from the OOM killer missed the target. This mode is also bothering people with "complete system crash" problem when the bullet from SysRq-f missed the target, for they have to use SysRq-i or SysRq-c or SysRq-b which is far more unnecessary kill of processes in order to solve the OOM condition. My proposal is to allow the OOM killer be invoked as consecutive shots mode. Although consecutive shots mode may increase possibility of killing processes unnecessarily, trying to kill an unkillable process in one shot mode is after all unnecessary kill of processes. The root cause is the same (i.e. the OOM killer cannot understand the dependency). My patch can stop bothering people with "silently stalling forever" / "complete system crash" problems by retrying the oom kill attempt than wait forever. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>