(Renamed thread's title and invited Dave Chinner. A memory stressing program at http://marc.info/?l=linux-mm&m=141890469424353&w=2 can trigger stalls on a system with 4 CPUs/2048MB of RAM/no swap. I want to hear your opinion.) Michal Hocko wrote: > > My question is quite simple. How can we avoid memory allocation stalls when > > > > System has 2048MB of RAM and no swap. > > Memcg1 for task1 has quota 512MB and 400MB in use. > > Memcg2 for task2 has quota 512MB and 400MB in use. > > Memcg3 for task3 has quota 512MB and 400MB in use. > > Memcg4 for task4 has quota 512MB and 400MB in use. > > Memcg5 for task5 has quota 512MB and 1MB in use. > > > > and task5 launches below memory consumption program which would trigger > > the global OOM killer before triggering the memcg OOM killer? > > > [...] > > The global OOM killer will try to kill this program because this program > > will be using 400MB+ of RAM by the time the global OOM killer is triggered. > > But sometimes this program cannot be terminated by the global OOM killer > > due to XFS lock dependency. > > > > You can see what is happening from OOM traces after uptime > 320 seconds of > > http://I-love.SAKURA.ne.jp/tmp/serial-20141213.txt.xz though memcg is not > > configured on this program. > > This is clearly a separate issue. It is a lock dependency and that alone > _cannot_ be handled from OOM killer as it doesn't understand lock > dependencies. This should be addressed from the xfs point of view IMHO > but I am not familiar with this filesystem to tell you how or whether it > is possible. > Then, let's ask Dave Chinner whether he can address it. My opinion is that everybody is doing __GFP_WAIT memory allocation without understanding the entire dependencies. Everybody is only prepared for allocation failures because everybody is expecting that the OOM killer shall somehow solve the OOM condition (except that some are expecting that memory stress that will trigger the OOM killer must not be given). I am neither familiar with XFS, but I don't think this issue can be addressed from the XFS point of view. For example, https://lkml.org/lkml/2014/7/2/249 stalls at blk_rq_map_kern() which I'm suspecting it as one of causes of the stall due to happening inside disk I/O event of XFS partition. If XFS were responsible for avoiding stall at blk_rq_map_kern() (on the assumption that XFS triggered that disk I/O event), XFS (filesystem layer) somehow needs to drop __GFP_WAIT flag from scsi_execute() (SCSI layer). We will end up with passing gfp flags to every function which might do memory allocation. Is everybody happy with such code complication/bloat? ---------- int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd, int data_direction, void *buffer, unsigned bufflen, unsigned char *sense, int timeout, int retries, u64 flags, int *resid) { struct request *req; int write = (data_direction == DMA_TO_DEVICE); int ret = DRIVER_ERROR << 24; req = blk_get_request(sdev->request_queue, write, __GFP_WAIT); if (IS_ERR(req)) return ret; blk_rq_set_block_pc(req); if (bufflen && blk_rq_map_kern(sdev->request_queue, req, buffer, bufflen, __GFP_WAIT)) goto out; req->cmd_len = COMMAND_SIZE(cmd[0]); memcpy(req->cmd, cmd, req->cmd_len); req->sense = sense; req->sense_len = 0; req->retries = retries; req->timeout = timeout; req->cmd_flags |= flags | REQ_QUIET | REQ_PREEMPT; /* * head injection *required* here otherwise quiesce won't work */ blk_execute_rq(req->q, NULL, req, 1); /* * Some devices (USB mass-storage in particular) may transfer * garbage data together with a residue indicating that the data * is invalid. Prevent the garbage from being misinterpreted * and prevent security leaks by zeroing out the excess data. */ if (unlikely(req->resid_len > 0 && req->resid_len <= bufflen)) memset(buffer + (bufflen - req->resid_len), 0, req->resid_len); if (resid) *resid = req->resid_len; ret = req->errors; out: blk_put_request(req); return ret; } ---------- By the way, if __GFP_WAIT requests had higher priority (lower or ignore the watermark?) than GFP_NOIO or GFP_NOFS or GFP_KERNEL requests, could blk_rq_map_kern() avoid the stall and allow XFS to proceed (and release XFS lock and terminate the OOM victim)? > > Somebody may set > > TIF_MEMDIE at oom_kill_process() even if we avoided setting TIF_MEMDIE at > > out_of_memory(). There will be more locations where TIF_MEMDIE is set; even > > out-of-tree modules might set TIF_MEMDIE. > > TIF_MEMDIE should be set only when we _know_ the task will free _some_ > memory and when we are killing the OOM victim. The only place I can see > that would break the first condition is out_of_memory for the current > which passed exit_mm(). That is the point why I've suggested you this > patch and it would be much more easier if we could simply finished that > one without pulling other things in. I agree that TIF_MEMDIE should be set only when we know the task will free some memory, but currently setting TIF_MEMDIE on the OOM victim is causing stalls which I want to analyze/debug via patchset posted at http://marc.info/?l=linux-mm&m=141671817211121&w=2 because we forever wait until the OOM victim terminates. In serial-20141213.txt.xz, TIF_MEMDIE was set on the OOM victim which is even unkillable by SysRq-f. > > Nonetheless, I don't think > > > > if (!task->mm && test_tsk_thread_flag(task, TIF_MEMDIE)) > > return true; > > > > check is perfect because we anyway need to prepare for both mm-less and > > with-mm cases. > > > > My concern is not "whether TIF_MEMDIE flag should be set or not". My concern > > is not "whether task->mm is NULL or not". My concern is "whether threads with > > TIF_MEMDIE flag retard other process' memory allocation or not". > > Above-mentioned program is an example of with-mm threads retarding > > other process' memory allocation. > > There is no way you can guarantee something like that. OOM is the _last_ > resort. Things are in a pretty bad state already when it hits. It is the > last attempt to reclaim some memory. System might be in an arbitrary > state at this time. > I really hate to repeat myself but you are trying to "fix" your problem > at a wrong level. I think that the OOM killer is responsible for killing the OOM condition or triggering kernel panic. I don't like that the OOM killer is failing to kill the OOM condition as it claims to be. > > > I know you don't like timeout approach, but adding > > > > if (sysctl_memdie_timeout_secs && test_tsk_thread_flag(task, TIF_MEMDIE) && > > time_after(jiffies, task->memdie_start + sysctl_memdie_timeout_secs * HZ)) > > return true; > > > > check to oom_unkillable_task() will take care of both mm-less and with-mm > > cases because everyone can safely skip the TIF_MEMDIE victim threads who > > cannot be terminated immediately for some reason. > > It will not take care of anything. It will start shooting to more > processes after some timeout, which is hard to get right, and there > wouldn't be any guaratee multiple victims will help because they might > end up blocking on the very same or other lock on the way out. If you don't like skip on timeout approach, I'm OK with triggering kernel panic on timeout approach. Analyzing vmcore will give us some hints about what was happening. > Jeez are > you even reading feedback you are getting? Of course, I'm reading your feedback. The "[RFC PATCH 0/5] mm: Patches for mitigating memory allocation stalls." will become unnecessary after all bugs are identified and fixed. I agree that bugs should be identified and fixed, but XFS stall is nothing but an example which I can reproduce on my desktop. My role is to analyze and respond to kernel troubles such as unexpected stalls, panics, reboots occurred on customer's servers which I don't have access. I will encounter various different troubles which I can't predict how to obtain information. Therefore, I want some unattended built-in assistance for understanding what was happening in chronological order and identifying/fixing the bugs. Existing built-in debugging hooks which requires administrator's operation might help after understanding what was happening. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>