Hello, Shakeel. On Thu, Dec 21, 2017 at 07:22:20AM -0800, Shakeel Butt wrote: > I am claiming memory allocations under global pressure will be > affected by the performance of the underlying swap device. However > memory allocations under memcg memory pressure, with memsw, will not > be affected by the performance of the underlying swap device. A job > having 100 MiB limit running on a machine without global memory > pressure will never see swap on hitting 100 MiB memsw limit. But, without global memory pressure, the swap wouldn't be making any difference to begin with. Also, when multiple cgroups are hitting memsw limits, they'd behave as if swappiness is zero increasing load on the filesystems, which then then of course will affect everyone under memory pressure whether memsw or not. > > On top of that, what's the point? > > > > 1. As I wrote earlier, given the current OOM killer implementation, > > whether OOM kicks in or not is not even that relevant in > > determining the health of the workload. There are frequent failure > > modes where OOM killer fails to kick in while the workload isn't > > making any meaningful forward progress. > > > > Deterministic oom-killer is not the point. The point is to > "consistently limit the anon memory" allocated by the job which only > memsw can provide. A job owner who has requested 100 MiB for a job > sees some instances of the job suffer at 100 MiB and other instances > suffer at 150 MiB, is an inconsistent behavior. So, the first part, I get. memsw happens to be be able to limit the amount of anon memory. I really don't think that was the intention but more of a byproduct that some people might find useful. The example you listed tho doesn't make much sense to me. Given two systems with differing level of memory pressures, two instances can see wildly different performance regardless of memsw. > > 2. On hitting memsw limit, the OOM decision is dependent on the > > performance of the file backing devices. Why is that necessarily > > better than being dependent on swap or both, which would increase > > the reclaim efficiency anyway? You can't avoid being affected by > > the underlying hardware one way or the other. > > This is a separate discussion but still the amount of file backed > pages is known and controlled by the job owner and they have the > option to use a storage service, providing a consistent performance > across different data centers, instead of the physical disks of the > system where the job is running and thus isolating the job's > performance from the speed of the local disk. This is not possible > with swap. The swap (and its performance) is and should be transparent > to the job owners. And, for your use case, there is a noticeable difference between file backed and anonymous memories and that's why you want to limit anonymous memory independently from file backed memory. It looks like what you actually want is limiting the amount of anonymous memory independently from file-backed consumptions because, in your setup, while swap is always on local disk the file storages are over network and more configurable / flexible. Assuming I'm not misunderstanding you, here are my thoughts. * I'm not sure that distinguishing anon and file backed memories like that is the direction we want to head. In fact, the more uniform we can behave across them, the more efficient we'd be as we wouldn't have that artificial barrier. It is true that we don't have the same level of control for swap tho. * Even if we want an independent anon limit, memsw isn't the solution. It's too conflated. If you want to have anon limit, the right thing to do would be pushing for an independent anon limit, not memsw. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html