On Tue 13-06-23 01:36:51, Yosry Ahmed wrote: > +David Rientjes > > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote: > > [...] > > > There has been a parallel discussion in the cover letter thread of v4 > > > [1]. To summarize, at Google, we have been using OOM scores to > > > describe different job priorities in a more explicit way -- regardless > > > of memory usage. It is strictly priority-based OOM killing. Ties are > > > broken based on memory usage. > > > > > > We understand that something like memory.oom.protect has an advantage > > > in the sense that you can skip killing a process if you know that it > > > won't free enough memory anyway, but for an environment where multiple > > > jobs of different priorities are running, we find it crucial to be > > > able to define strict ordering. Some jobs are simply more important > > > than others, regardless of their memory usage. > > > > I do remember that discussion. I am not a great fan of simple priority > > based interfaces TBH. It sounds as an easy interface but it hits > > complications as soon as you try to define a proper/sensible > > hierarchical semantic. I can see how they might work on leaf memcgs with > > statically assigned priorities but that sounds like a very narrow > > usecase IMHO. > > Do you mind elaborating the problem with the hierarchical semantics? Well, let me be more specific. If you have a simple hierarchical numeric enforcement (assume higher priority more likely to be chosen and the effective priority to be max(self, max(parents)) then the semantic itslef is straightforward. I am not really sure about the practical manageability though. I have hard time to imagine priority assignment on something like a shared workload with a more complex hierarchy. For example: root / | \ cont_A cont_B cont_C each container running its workload with own hierarchy structures that might be rather dynamic during the lifetime. In order to have a predictable OOM behavior you need to watch and reassign priorities all the time, no? > The way it works with our internal implementation is (imo) sensible > and straightforward from a hierarchy POV. Starting at the OOM memcg > (which can be root), we recursively compare the OOM scores of the > children memcgs and pick the one with the lowest score, until we > arrive at a leaf memcg. This approach has a strong requirement on the memcg hierarchy organization. Siblings have to be directly comparable because you cut off many potential sub-trees this way (e.g. is it easy to tell whether you want to rule out all system or user slices?). I can imagine usecases where this could work reasonably well e.g. a set of workers of a different priority all of them running under a shared memcg parent. But more more involved hierarchies seem more complex because you always keep in mind how the hierarchy is organize to get to your desired victim. -- Michal Hocko SUSE Labs