On Sat, Nov 12, 2022 at 10:00:01AM +0100, Jiri Wiesner wrote: > On Mon, Oct 31, 2022 at 04:56:43PM +0200, Julian Anastasov wrote: > > Use the provided u64_stats_t type to avoid > > load/store tearing. I think only 32-bit machines have a problem storing a 64-bit counter atomically. u64_stats_fetch_begin() and u64_stats_fetch_retry() should take care of that. > > > > Fixes: 316580b69d0a ("u64_stats: provide u64_stats_t type") > > Signed-off-by: Julian Anastasov <ja@xxxxxx> > > --- > > include/net/ip_vs.h | 10 +++++----- > > net/netfilter/ipvs/ip_vs_core.c | 30 +++++++++++++++--------------- > > net/netfilter/ipvs/ip_vs_ctl.c | 10 +++++----- > > net/netfilter/ipvs/ip_vs_est.c | 20 ++++++++++---------- > > 4 files changed, 35 insertions(+), 35 deletions(-) > > > > diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h > > index e5582c01a4a3..a4d44138c2a8 100644 > > --- a/include/net/ip_vs.h > > +++ b/include/net/ip_vs.h > > @@ -351,11 +351,11 @@ struct ip_vs_seq { > > > > /* counters per cpu */ > > struct ip_vs_counters { > > - __u64 conns; /* connections scheduled */ > > - __u64 inpkts; /* incoming packets */ > > - __u64 outpkts; /* outgoing packets */ > > - __u64 inbytes; /* incoming bytes */ > > - __u64 outbytes; /* outgoing bytes */ > > + u64_stats_t conns; /* connections scheduled */ > > + u64_stats_t inpkts; /* incoming packets */ > > + u64_stats_t outpkts; /* outgoing packets */ > > + u64_stats_t inbytes; /* incoming bytes */ > > + u64_stats_t outbytes; /* outgoing bytes */ > > }; > > /* Stats per cpu */ > > struct ip_vs_cpu_stats { > > Converting the per cpu stat to u64_stats_t means that the compiler cannot optimize the memory access and addition on x86_64. Previously, the summation of per cpu counters in ip_vs_chain_estimation() looked like: > 15b65: add (%rsi),%r14 > 15b68: add 0x8(%rsi),%r15 > 15b6c: add 0x10(%rsi),%r13 > 15b70: add 0x18(%rsi),%r12 > 15b74: add 0x20(%rsi),%rbp > The u64_stats_read() calls in ip_vs_chain_estimation() turned it into: > 159d5: mov (%rcx),%r11 > 159d8: mov 0x8(%rcx),%r10 > 159dc: mov 0x10(%rcx),%r9 > 159e0: mov 0x18(%rcx),%rdi > 159e4: mov 0x20(%rcx),%rdx > 159e8: add %r11,%r14 > 159eb: add %r10,%r13 > 159ee: add %r9,%r12 > 159f1: add %rdi,%rbp > 159f4: add %rdx,%rbx > I guess that is not a big deal because the mov should be the instruction taking the most time on account of accessing per cpu regions of other CPUs. The add will be fast. Another concern is the number of registers needed for the summation. Previously, 5 registers were needed. Now, 10 registers are needed. This would matter mostly if the size of the stack frame of ip_vs_chain_estimation() and the number of callee-saved registers stored to the stack changed, but introducing u64_stats_read() in ip_vs_chain_estimation() did not change that. -- Jiri Wiesner SUSE Labs