Re: [PATCH RFC] net: memcg accounting for veth devices

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Tue, 1 Mar 2022 13:25:16 -0800

On Tue, Mar 01, 2022 at 02:50:06PM -0600, Eric W. Biederman wrote:
> Luis Chamberlain <mcgrof@xxxxxxxxxx> writes:
> 
> > On Tue, Mar 01, 2022 at 10:09:17AM -0800, Shakeel Butt wrote:
> >> On Mon, Feb 28, 2022 at 06:36:58AM -0800, Luis Chamberlain wrote:
> >> > On Mon, Feb 28, 2022 at 10:17:16AM +0300, Vasily Averin wrote:
> >> > > Following one-liner running inside memcg-limited container consumes
> >> > > huge number of host memory and can trigger global OOM.
> >> > >
> >> > > for i in `seq 1 xxx` ; do ip l a v$i type veth peer name vp$i ; done
> >> > >
> >> > > Patch accounts most part of these allocations and can protect host.
> >> > > ---[cut]---
> >> > > It is not polished, and perhaps should be splitted.
> >> > > obviously it affects other kind of netdevices too.
> >> > > Unfortunately I'm not sure that I will have enough time to handle it
> >> > properly
> >> > > and decided to publish current patch version as is.
> >> > > OpenVz workaround it by using per-container limit for number of
> >> > > available netdevices, but upstream does not have any kind of
> >> > > per-container configuration.
> >> > > ------
> >> 
> >> > Should this just be a new ucount limit on kernel/ucount.c and have veth
> >> > use something like inc_ucount(current_user_ns(), current_euid(),
> >> > UCOUNT_VETH)?
> >> 
> >> > This might be abusing ucounts though, not sure, Eric?
> >> 
> >> 
> >> For admins of systems running multiple workloads, there is no easy way
> >> to set such limits for each workload.
> >
> > That's why defaults would exist. Today's ulimits IMHO are insane and
> > some are arbitrarily large.
> 
> My perspective is that we have two basic kinds of limits.
> 
> Limits to catch programs that go out of control hopefully before they
> bring down the entire system.  This is the purpose I see of rlimits and
> ucounts.  Such limits should be set by default so large that no one has
> to care unless their program is broken.
> 
> Limits to contain programs and keep them from having a negative impact
> on other programs.  Generally this is the role I see the cgroups
> playing.  This limits must be much more tightly managed.
> 
> The problem with veth that was reported was that the memory cgroup
> limits fails to contain veth's allocations and veth manages to affect
> process outside the memory cgroup where the veth ``lives''.  The effect
> is an OOM but the problem is that it is affecting processes out of the
> memory control group.

Given no upper bound was used in the commit log, it seems to have
presented a use case where both types of limits might need to be
considered though.

> Part of the reason for the recent ucount work is so that ordinary users
> can create user namespaces and root in that user namespace won't be able
> to exceed the limits that were set when the user namespace was created
> by creating additional users.

Got it.

> Part of the reason for my ucount work is my frustration that cgroups
> would up something completely different than what was originally
> proposed and solve a rather problem set.  Originally the proposal was
> that cgroups would be the user interface for the bean-counter patches.
> (Roughly counts like the ucounts are now).  Except for maybe the pid
> controller you mention below cgroups look nothing like that today.
> So I went and I solved the original problem because it was still not
> solved.

I see...

> The network stack should already have packet limits to prevent a global
> OOM so I am a bit curious why those limits aren't preventing a global
> OOM in for the veth device.

No packets are used in the demo / commit log, it is just creating
tons of veths that OOMs.

> I am not saying that the patch is correct (although from 10,000 feet the
> patch sounds like it is solving the reported problem).