On Tue, Mar 01, 2022 at 02:50:06PM -0600, Eric W. Biederman wrote: > Luis Chamberlain <mcgrof@xxxxxxxxxx> writes: > > > On Tue, Mar 01, 2022 at 10:09:17AM -0800, Shakeel Butt wrote: > >> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote: > >> > On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote: > >> > > Following one-liner running inside memcg-limited container consumes > >> > > huge number of host memory and can trigger global OOM. > >> > > > >> > > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done > >> > > > >> > > Patch accounts most part of these allocations and can protect host. > >> > > ---[cut]--- > >> > > It is not polished, and perhaps should be splitted. > >> > > obviously it affects other kind of netdevices too. > >> > > Unfortunately I'm not sure that I will have enough time to handle it > >> > properly > >> > > and decided to publish current patch version as is. > >> > > OpenVz workaround it by using per-container limit for number of > >> > > available netdevices, but upstream does not have any kind of > >> > > per-container configuration. > >> > > ------ > >> > >> > Should this just be a new ucount limit on kernel/ucount.c and have veth > >> > use something like inc_ucount(current_user_ns(), current_euid(), > >> > UCOUNT_VETH)? > >> > >> > This might be abusing ucounts though, not sure, Eric? > >> > >> > >> For admins of systems running multiple workloads, there is no easy way > >> to set such limits for each workload. > > > > That's why defaults would exist. Today's ulimits IMHO are insane and > > some are arbitrarily large. > > My perspective is that we have two basic kinds of limits. > > Limits to catch programs that go out of control hopefully before they > bring down the entire system. This is the purpose I see of rlimits and > ucounts. Such limits should be set by default so large that no one has > to care unless their program is broken. > > Limits to contain programs and keep them from having a negative impact > on other programs. Generally this is the role I see the cgroups > playing. This limits must be much more tightly managed. > > The problem with veth that was reported was that the memory cgroup > limits fails to contain veth's allocations and veth manages to affect > process outside the memory cgroup where the veth ``lives''. The effect > is an OOM but the problem is that it is affecting processes out of the > memory control group. Given no upper bound was used in the commit log, it seems to have presented a use case where both types of limits might need to be considered though. > Part of the reason for the recent ucount work is so that ordinary users > can create user namespaces and root in that user namespace won't be able > to exceed the limits that were set when the user namespace was created > by creating additional users. Got it. > Part of the reason for my ucount work is my frustration that cgroups > would up something completely different than what was originally > proposed and solve a rather problem set. Originally the proposal was > that cgroups would be the user interface for the bean-counter patches. > (Roughly counts like the ucounts are now). Except for maybe the pid > controller you mention below cgroups look nothing like that today. > So I went and I solved the original problem because it was still not > solved. I see... > The network stack should already have packet limits to prevent a global > OOM so I am a bit curious why those limits aren't preventing a global > OOM in for the veth device. No packets are used in the demo / commit log, it is just creating tons of veths that OOMs. > I am not saying that the patch is correct (although from 10,000 feet the > patch sounds like it is solving the reported problem).