On Thu, May 25, 2023 at 1:19 AM 程垲涛 Chengkaitao Cheng <chengkaitao@xxxxxxxxxxxxxx> wrote: > > At 2023-05-24 06:02:55, "Yosry Ahmed" <yosryahmed@xxxxxxxxxx> wrote: > >On Sat, May 20, 2023 at 2:52 AM 程垲涛 Chengkaitao Cheng > ><chengkaitao@xxxxxxxxxxxxxx> wrote: > >> > >> At 2023-05-20 06:04:26, "Yosry Ahmed" <yosryahmed@xxxxxxxxxx> wrote: > >> >On Wed, May 17, 2023 at 10:12 PM 程垲涛 Chengkaitao Cheng > >> ><chengkaitao@xxxxxxxxxxxxxx> wrote: > >> >> > >> >> At 2023-05-18 04:42:12, "Yosry Ahmed" <yosryahmed@xxxxxxxxxx> wrote: > >> >> >On Wed, May 17, 2023 at 3:01 AM 程垲涛 Chengkaitao Cheng > >> >> ><chengkaitao@xxxxxxxxxxxxxx> wrote: > >> >> >> > >> >> >> At 2023-05-17 16:09:50, "Yosry Ahmed" <yosryahmed@xxxxxxxxxx> wrote: > >> >> >> >On Wed, May 17, 2023 at 1:01 AM 程垲涛 Chengkaitao Cheng > >> >> >> ><chengkaitao@xxxxxxxxxxxxxx> wrote: > >> >> >> >> > >> >> >> > >> >> >> Killing processes in order of memory usage cannot effectively protect > >> >> >> important processes. Killing processes in a user-defined priority order > >> >> >> will result in a large number of OOM events and still not being able to > >> >> >> release enough memory. I have been searching for a balance between > >> >> >> the two methods, so that their shortcomings are not too obvious. > >> >> >> The biggest advantage of memcg is its tree topology, and I also hope > >> >> >> to make good use of it. > >> >> > > >> >> >For us, killing processes in a user-defined priority order works well. > >> >> > > >> >> >It seems like to tune memory.oom.protect you use oom_kill_inherit to > >> >> >observe how many times this memcg has been killed due to a limit in an > >> >> >ancestor. Wouldn't it be more straightforward to specify the priority > >> >> >of protections among memcgs? > >> >> > > >> >> >For example, if you observe multiple memcgs being OOM killed due to > >> >> >hitting an ancestor limit, you will need to decide which of them to > >> >> >increase memory.oom.protect for more, based on their importance. > >> >> >Otherwise, if you increase all of them, then there is no point if all > >> >> >the memory is protected, right? > >> >> > >> >> If all memory in memcg is protected, its meaning is similar to that of the > >> >> highest priority memcg in your approach, which is ultimately killed or > >> >> never killed. > >> > > >> >Makes sense. I believe it gets a bit trickier when you want to > >> >describe relative ordering between memcgs using memory.oom.protect. > >> > >> Actually, my original intention was not to use memory.oom.protect to > >> achieve relative ordering between memcgs, it was just a feature that > >> happened to be achievable. My initial idea was to protect a certain > >> proportion of memory in memcg from being killed, and through the > >> method, physical memory can be reasonably planned. Both the physical > >> machine manager and container manager can add some unimportant > >> loads beyond the oom.protect limit, greatly improving the oversold > >> rate of memory. In the worst case scenario, the physical machine can > >> always provide all the memory limited by memory.oom.protect for memcg. > >> > >> On the other hand, I also want to achieve relative ordering of internal > >> processes in memcg, not just a unified ordering of all memcgs on > >> physical machines. > > > >For us, having a strict priority ordering-based selection is > >essential. We have different tiers of jobs of different importance, > >and a job of higher priority should not be killed before a lower > >priority task if possible, no matter how much memory either of them is > >using. Protecting memcgs solely based on their usage can be useful in > >some scenarios, but not in a system where you have different tiers of > >jobs running with strict priority ordering. > > If you want to run with strict priority ordering, it can also be achieved, > but it may be quite troublesome. The directory structure shown below > can achieve the goal. > > root > / \ > cgroup A cgroup B > (protect=max) (protect=0) > / \ > cgroup C cgroup D > (protect=max) (protect=0) > / \ > cgroup E cgroup F > (protect=max) (protect=0) > > Oom kill order: F > E > C > A This requires restructuring the cgroup hierarchy which comes with a lot of other factors, I don't think that's practically an option. > > As mentioned earlier, "running with strict priority ordering" may be > some extreme issues, that requires the manager to make a choice. We have been using strict priority ordering in our fleet for many years now and we depend on it. Some jobs are simply more important than others, regardless of their usage. > > >> > >> >> >In this case, wouldn't it be easier to just tell the OOM killer the > >> >> >relative priority among the memcgs? > >> >> > > >> >> >> > >> >> >> >If this approach works for you (or any other audience), that's great, > >> >> >> >I can share more details and perhaps we can reach something that we > >> >> >> >can both use :) > >> >> >> > >> >> >> If you have a good idea, please share more details or show some code. > >> >> >> I would greatly appreciate it > >> >> > > >> >> >The code we have needs to be rebased onto a different version and > >> >> >cleaned up before it can be shared, but essentially it is as > >> >> >described. > >> >> > > >> >> >(a) All processes and memcgs start with a default score. > >> >> >(b) Userspace can specify scores for memcgs and processes. A higher > >> >> >score means higher priority (aka less score gets killed first). > >> >> >(c) The OOM killer essentially looks for the memcg with the lowest > >> >> >scores to kill, then among this memcg, it looks for the process with > >> >> >the lowest score. Ties are broken based on usage, so essentially if > >> >> >all processes/memcgs have the default score, we fallback to the > >> >> >current OOM behavior. > >> >> > >> >> If memory oversold is severe, all processes of the lowest priority > >> >> memcg may be killed before selecting other memcg processes. > >> >> If there are 1000 processes with almost zero memory usage in > >> >> the lowest priority memcg, 1000 invalid kill events may occur. > >> >> To avoid this situation, even for the lowest priority memcg, > >> >> I will leave him a very small oom.protect quota. > >> > > >> >I checked internally, and this is indeed something that we see from > >> >time to time. We try to avoid that with userspace OOM killing, but > >> >it's not 100% effective. > >> > > >> >> > >> >> If faced with two memcgs with the same total memory usage and > >> >> priority, memcg A has more processes but less memory usage per > >> >> single process, and memcg B has fewer processes but more > >> >> memory usage per single process, then when OOM occurs, the > >> >> processes in memcg B may continue to be killed until all processes > >> >> in memcg B are killed, which is unfair to memcg B because memcg A > >> >> also occupies a large amount of memory. > >> > > >> >I believe in this case we will kill one process in memcg B, then the > >> >usage of memcg A will become higher, so we will pick a process from > >> >memcg A next. > >> > >> If there is only one process in memcg A and its memory usage is higher > >> than any other process in memcg B, but the total memory usage of > >> memcg A is lower than that of memcg B. In this case, if the OOM-killer > >> still chooses the process in memcg A. it may be unfair to memcg A. > >> > >> >> Dose your approach have these issues? Killing processes in a > >> >> user-defined priority is indeed easier and can work well in most cases, > >> >> but I have been trying to solve the cases that it cannot cover. > >> > > >> >The first issue is relatable with our approach. Let me dig more info > >> >from our internal teams and get back to you with more details. > > -- > Thanks for your comment! > chengkaitao > >