On Thu 15-04-21 15:31:46, Tim Chen wrote: > > > On 4/9/21 12:24 AM, Michal Hocko wrote: > > On Thu 08-04-21 13:29:08, Shakeel Butt wrote: > >> On Thu, Apr 8, 2021 at 11:01 AM Yang Shi <shy828301@xxxxxxxxx> wrote: > > [...] > >>> The low priority jobs should be able to be restricted by cpuset, for > >>> example, just keep them on second tier memory nodes. Then all the > >>> above problems are gone. > > > > Yes, if the aim is to isolate some users from certain numa node then > > cpuset is a good fit but as Shakeel says this is very likely not what > > this work is aiming for. > > > >> Yes that's an extreme way to overcome the issue but we can do less > >> extreme by just (hard) limiting the top tier usage of low priority > >> jobs. > > > > Per numa node high/hard limit would help with a more fine grained control. > > The configuration would be tricky though. All low priority memcgs would > > have to be carefully configured to leave enough for your important > > processes. That includes also memory which is not accounted to any > > memcg. > > The behavior of those limits would be quite tricky for OOM situations > > as well due to a lack of NUMA aware oom killer. > > > > Another downside of putting limits on individual NUMA > node is it would limit flexibility. Let me just clarify one thing. I haven't been proposing per NUMA limits. As I've said above it would be quite tricky to use and the behavior would be tricky as well. All I am saying is that we do not want to have an interface that is tightly bound to any specific HW setup (fast RAM as a top tier and PMEM as a fallback) that you have proposed here. We want to have a generic NUMA based abstraction. How that abstraction is going to look like is an open question and it really depends on usecase that we expect to see. -- Michal Hocko SUSE Labs