On Fri 09-12-22 05:07:15, 程垲涛 Chengkaitao Cheng wrote: > At 2022-12-08 22:23:56, "Michal Hocko" <mhocko@xxxxxxxx> wrote: [...] > >oom killer is a memory reclaim of the last resort. So yes, there is some > >difference but fundamentally it is about releasing some memory. And long > >term we have learned that the more clever it tries to be the more likely > >corner cases can happen. It is simply impossible to know the best > >candidate so this is a just a best effort. We try to aim for > >predictability at least. > > Is the current oom_score strategy predictable? I don't think so. The score_adj > has broken the predictability of oom_score (it is no longer simply killing the > process that uses the most mems). oom_score as reported to the userspace already considers oom_score_adj which means that you can compare processes and get a reasonable guess what would be the current oom_victim. There is a certain fuzz level because this is not atomic and also there is no clear candidate when multiple processes have equal score. So yes, it is not 100% predictable. memory.reclaim as you propose doesn't change that though. Is oom_score_adj a good interface? No, not really. If I could go back in time I would nack it but here we are. We have an interface that promises quite much but essentially it only allows two usecases (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is clumsy at best because a real user space oom policy would require to re-evaluate the whole oom domain (be it global or memcg oom) as the memory consumption evolves over time. I am really worried that your memory.oom.protection directs a very similar trajectory because protection really needs to consider other memcgs to balance properly. [...] > > But I am really open > >to be convinced otherwise and this is in fact what I have been asking > >for since the beginning. I would love to see some examples on the > >reasonable configuration for a practical usecase. > > Here is a simple example. In a docker container, users can divide all processes > into two categories (important and normal), and put them in different cgroups. > One cgroup's oom.protect is set to "max", the other is set to "0". In this way, > important processes in the container can be protected. That is effectivelly oom_score_adj = OOM_SCORE_ADJ_MIN - 1 to all processes in the important group. I would argue you can achieve a very similar result by the process launcher to set the oom_score_adj and inherit it to all processes in that important container. You do not need any memcg tunable for that. I am really much more interested in examples when the protection is to be fine tuned. -- Michal Hocko SUSE Labs