Re: [net] 4890b686f4: netperf.Throughput_Mbps -69.4% regression

Xin Long <lucien.xin@xxxxxxxxx> · Thu, 23 Jun 2022 18:50:07 -0400

On Wed, Jun 22, 2022 at 11:08 PM Xin Long <lucien.xin@xxxxxxxxx> wrote:
>
> Yes, I'm working on it. I couldn't see the regression in my env with
> the 'reproduce' script attached.
> I will try with lkp tomorrow.
>
> Thanks.
>
> On Wed, Jun 22, 2022 at 8:29 PM Jakub Kicinski <kuba@xxxxxxxxxx> wrote:
> >
> > Could someone working on SCTP double check this is a real regression?
> > Feels like the regression reports are flowing at such rate its hard
> > to keep up.
> >
> > >
> > > commit:
> > >   7c80b038d2 ("net: fix sk_wmem_schedule() and sk_rmem_schedule() errors")
> > >   4890b686f4 ("net: keep sk->sk_forward_alloc as small as possible")
> > >
> > > 7c80b038d23e1f4c 4890b686f4088c90432149bd6de
> > > ---------------- ---------------------------
> > >          %stddev     %change         %stddev
> > >              \          |                \
> > >      15855           -69.4%       4854        netperf.Throughput_Mbps
> > >     570788           -69.4%     174773        netperf.Throughput_total_Mbps
...
> > >       0.00            +5.1        5.10 ±  5%  perf-profile.calltrace.cycles-pp.__sk_mem_reduce_allocated.sctp_wfree.skb_release_head_state.consume_skb.sctp_chunk_put
> > >       0.17 ±141%      +5.3        5.42 ±  6%  perf-profile.calltrace.cycles-pp.skb_release_head_state.consume_skb.sctp_chunk_put.sctp_outq_sack.sctp_cmd_interpreter
> > >       0.00            +5.3        5.35 ±  6%  perf-profile.calltrace.cycles-pp.sctp_wfree.skb_release_head_state.consume_skb.sctp_chunk_put.sctp_outq_sack
> > >       0.00            +5.5        5.51 ±  6%  perf-profile.calltrace.cycles-pp.__sk_mem_reduce_allocated.skb_release_head_state.kfree_skb_reason.sctp_recvmsg.inet_recvmsg
> > >       0.00            +5.7        5.65 ±  6%  perf-profile.calltrace.cycles-pp.skb_release_head_state.kfree_skb_reason.sctp_recvmsg.inet_recvmsg.____sys_recvmsg
...
> > >       0.00            +4.0        4.04 ±  6%  perf-profile.children.cycles-pp.mem_cgroup_charge_skmem
> > >       2.92 ±  6%      +4.2        7.16 ±  6%  perf-profile.children.cycles-pp.sctp_outq_sack
> > >       0.00            +4.3        4.29 ±  6%  perf-profile.children.cycles-pp.__sk_mem_raise_allocated
> > >       0.00            +4.3        4.32 ±  6%  perf-profile.children.cycles-pp.__sk_mem_schedule
> > >       1.99 ±  6%      +4.4        6.40 ±  6%  perf-profile.children.cycles-pp.consume_skb
> > >       1.78 ±  6%      +4.6        6.42 ±  6%  perf-profile.children.cycles-pp.kfree_skb_reason
> > >       0.37 ±  8%      +5.0        5.40 ±  6%  perf-profile.children.cycles-pp.sctp_wfree
> > >       0.87 ±  9%     +10.3       11.20 ±  6%  perf-profile.children.cycles-pp.skb_release_head_state
> > >       0.00           +10.7       10.66 ±  6%  perf-profile.children.cycles-pp.__sk_mem_reduce_allocated
...
> > >       0.00            +1.2        1.19 ±  7%  perf-profile.self.cycles-pp.try_charge_memcg
> > >       0.00            +2.0        1.96 ±  6%  perf-profile.self.cycles-pp.page_counter_uncharge
> > >       0.00            +2.1        2.07 ±  5%  perf-profile.self.cycles-pp.page_counter_try_charge
> > >       1.09 ±  8%      +2.8        3.92 ±  6%  perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
> > >       0.29 ±  6%      +3.5        3.81 ±  6%  perf-profile.self.cycles-pp.sctp_eat_data
> > >       0.00            +7.8        7.76 ±  6%  perf-profile.self.cycles-pp.__sk_mem_reduce_allocated

>From the perf data, we can see __sk_mem_reduce_allocated() is the one
using CPU the most more than before, and mem_cgroup APIs are also
called in this function. It means the mem cgroup must be enabled in
the test env, which may explain why I couldn't reproduce it.

The Commit 4890b686f4 ("net: keep sk->sk_forward_alloc as small as
possible") uses sk_mem_reclaim(checking reclaimable >= PAGE_SIZE) to
reclaim the memory, which is *more frequent* to call
__sk_mem_reduce_allocated() than before (checking reclaimable >=
SK_RECLAIM_THRESHOLD). It might be cheap when
mem_cgroup_sockets_enabled is false, but I'm not sure if it's still
cheap when mem_cgroup_sockets_enabled is true.

I think SCTP netperf could trigger this, as the CPU is the bottleneck
for SCTP netperf testing, which is more sensitive to the extra
function calls than TCP.

Can we re-run this testing without mem cgroup enabled?

Thanks.