Re: Re: [RFC PATCH net-next] sock: Propose socket.urgent for sockmem isolation

Abel Wu <wuyun.abel@xxxxxxxxxxxxx> · Tue, 13 Jun 2023 14:46:55 +0800

On 6/10/23 1:53 AM, Shakeel Butt wrote:
On Fri, Jun 9, 2023 at 2:07 PM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:

On Fri, Jun 9, 2023 at 10:28 AM Abel Wu <wuyun.abel@xxxxxxxxxxxxx> wrote:

This is just a PoC patch intended to resume the discussion about
tcpmem isolation opened by Google in LPC'22 [1].

We are facing the same problem that the global shared threshold can
cause isolation issues. Low priority jobs can hog TCP memory and
adversely impact higher priority jobs. What's worse is that these
low priority jobs usually have smaller cpu weights leading to poor
ability to consume rx data.

To tackle this problem, an interface for non-root cgroup memory
controller named 'socket.urgent' is proposed. It determines whether
the sockets of this cgroup and its descendants can escape from the
constrains or not under global socket memory pressure.

The 'urgent' semantics will not take effect under memcg pressure in
order to protect against worse memstalls, thus will be the same as
before without this patch.

This proposal doesn't remove protocal's threshold as we found it
useful in restraining memory defragment. As aforementioned the low
priority jobs can hog lots of memory, which is unreclaimable and
unmovable, for some time due to small cpu weight.

So in practice we allow high priority jobs with net-memcg accounting
enabled to escape the global constrains if the net-memcg itselt is
not under pressure. While for lower priority jobs, the budget will
be tightened as the memory usage of 'urgent' jobs increases. In this
way we can finally achieve:

   - Important jobs won't be priority inversed by the background
     jobs in terms of socket memory pressure/limit.

   - Global constrains are still effective, but only on non-urgent
     jobs, useful for admins on policy decision on defrag.

Comments/Ideas are welcomed, thanks!

This seems to go in a complete opposite direction than memcg promises.

Can we fix memcg, so that :

Each group can use the memory it was provisioned (this includes TCP buffers)

Global tcp_memory can disappear (set tcp_mem to infinity)

I agree with Eric and this is exactly how we at Google overcome the
isolation issue. We have set tcp_mem to unlimited and enabled memcg
accounting of network memory (by surgically incorporating v2 semantics
of network memory accounting in our v1 environment).

I do have one question though:

This proposal doesn't remove protocal's threshold as we found it
useful in restraining memory defragment.

Can you explain how you find the global tcp limit useful? What does
memory defragment mean?

We co-locate different kinds of jobs with different priority in cgroups,
among which there are some background jobs can have lots of net data to
process, e.g. training jobs. These background jobs usually don't have
enough cpu bandwidth to consume the rx data in time if more important
jobs are running simultaneously. The data can be accumulated to eat up
some or all of the provisioned memory. These unreclaimable memory could
gradually fragment whole memory. We have already found many such cases
in production environment.

Maybe it's not proper to use the word 'defragment' as what we do is to
try to prevent from fragmentation rather than defrag like compaction.
With global tcp_mem pressure/limit and socket.urgent, we are able to
achieve this goal, at least at some extent.

And not only global tcp limit, the pressure threshold could also make
something like priority inversion happen. We monitored top20 priority
jobs and found their performance reduced by 2~9% when under global tcp
memory pressure (and sometimes the majority of sk_memory_allocated()
can be contributed by the low priority jobs). Although this has nothing
to do with 'memory defrag'.