Re: [RFC PATCHv6 3/7] ipvs: use u64_stats_t for the per-cpu counters

Julian Anastasov <ja@xxxxxx> · Tue, 15 Nov 2022 18:53:22 +0200 (EET)

	Hello,

On Tue, 15 Nov 2022, Jiri Wiesner wrote:

> On Sat, Nov 12, 2022 at 06:01:36PM +0200, Julian Anastasov wrote:
> > 
> > 	Some compilers for 64-bit can use two 32-bit
> > loads (load tearing) while we require atomic 64-bit read
> > in case reader is interrupted by updater. That is why READ_ONCE
> > is used now. I don't know which compilers can do this. That is
> > why I switched to u64_stats_t to correctly use the u64_stats
> > interface. And our data is 8-byte aligned. 32-bit are using
> > seqcount_t, so they are protected for this problem.
> 
> All right. It is counter-intuitive but I guess those compiles have their reasons. I think it would not hurt to expand the description of the patch with the explanation above.

	OK. I relied on the mentioned commit to explain the problem
as other similar patches do.

> > 	If you think we are better with such change, we
> > can include it in patch 4. It will be better in case one day
> > we add more counters and there are more available registers.
> > You can notice the difference probably by comparing the
> > chain_max value in performance mode.
> 
> I have tested the change to ip_vs_chain_estimation() and ran profiling with bus-cycles, which are proportional to time spent executing code. The number of estimator per kthread was the same in both tests - 88800. It turns out the change made the code slower:
> > #       Util1          Util2                    Diff  Command          Shared Object        Symbol   CPU
> > 1,124,932,796  1,143,867,951     18,935,155 (  1.7%)  ipvs-e:0:0       [ip_vs]              all      all

	chain_max is probably 37 here and 1.7% is low
enough (below 1/37) not to be accounted in chain_max.

> >    16,480,274     16,853,752        373,478 (  2.3%)  ipvs-e:0:1       [ip_vs]              all      all
> >             0         21,987         21,987 (100.0%)  ipvs-e:0:0       [kvm]                all      all
> >     4,622,992      4,366,150       -256,842 (  5.6%)  ipvs-e:0:1       [kernel.kallsyms]    all      all
> >   180,226,857    164,665,271    -15,561,586 (  8.6%)  ipvs-e:0:0       [kernel.kallsyms]    all      all
> The most significant component was the time spent in ip_vs_chain_estimation():
> > #       Util1          Util2                    Diff  Command          Shared Object        Symbol                     CPU
> > 1,124,414,110  1,143,352,233     18,938,123 (  1.7%)  ipvs-e:0:0       [ip_vs]              ip_vs_chain_estimation     all
> >    16,291,044     16,574,498        283,454 (  1.7%)  ipvs-e:0:1       [ip_vs]              ip_vs_chain_estimation     all
> >       189,230        279,254         90,024 ( 47.6%)  ipvs-e:0:1       [ip_vs]              ip_vs_estimation_kthread   all
> >       518,686        515,718         -2,968 (  0.6%)  ipvs-e:0:0       [ip_vs]              ip_vs_estimation_kthread   all
> >     1,562,512      1,377,609       -184,903 ( 11.8%)  ipvs-e:0:0       [kernel.kallsyms]    kthread_should_stop        all
> >     2,435,803      2,138,404       -297,399 ( 12.2%)  ipvs-e:0:1       [kernel.kallsyms]    _find_next_bit             all
> >    16,304,577     15,786,553       -518,024 (  3.2%)  ipvs-e:0:0       [kernel.kallsyms]    _raw_spin_lock             all
> >   151,110,969    137,172,160    -13,938,809 (  9.2%)  ipvs-e:0:0       [kernel.kallsyms]    _find_next_bit             all
>  
> The disassembly shows it is the mov instructions (there is a skid of 1 instruction) what takes the most time:
> >Percent |      Source code & Disassembly of kcore for bus-cycles (55354 samples, percent: local period)
> >        : ffffffffc0bad980 <ip_vs_chain_estimation>:
> >            v6 patchset                            v6 patchset + ip_vs_chain_estimation() change
> >-------------------------------------------------------------------------------------------------------

> >   0.00 :       add    -0x72ead560(,%rdx,8),%rcx    0.00 :       add    -0x5ccad560(,%rdx,8),%rcx
> >   4.07 :       mov    (%rcx),%r11                  3.84 :       mov    (%rcx),%rdx
> >  34.28 :       mov    0x8(%rcx),%r10              34.17 :       add    %rdx,%r14
> >  10.35 :       mov    0x10(%rcx),%r9               2.62 :       mov    0x8(%rcx),%rdx
> >   7.28 :       mov    0x18(%rcx),%rdi              9.33 :       add    %rdx,%r13
> >   7.70 :       mov    0x20(%rcx),%rdx              1.82 :       mov    0x10(%rcx),%rdx
> >   6.80 :       add    %r11,%r14                    6.11 :       add    %rdx,%r12
> >   0.27 :       add    %r10,%r13                    1.26 :       mov    0x18(%rcx),%rdx
> >   0.36 :       add    %r9,%r12                     6.00 :       add    %rdx,%rbp
> >   2.24 :       add    %rdi,%rbp                    2.97 :       mov    0x20(%rcx),%rdx

	Bad, same register RDX used. I guess, this is
CONFIG_GENERIC_CPU=y kernel which is suitable for different
CPUs. My example was tested with CONFIG_MCORE2=y. But the
optimizations are compiler's job, there can be a cost
to free registers for our operations. Who knows, change
can be faster on other platforms with less available
registers to hold the 5 counters and their sum in registers.

	Probably, we can force it to add mem8 into reg by using
something like this:

static inline void u64_stats_read_add(u64_stats_t *p, u64 *sum)
{
	local64_add_to(&p->v, (long *) sum);
}

static inline void local_add_to(local_t *l, long *sum)
{
	asm(_ASM_ADD "%1,%0"
	    : "+r" (*sum)
	    : "m" (l->a.counter));
}

	But it deserves more testing.

> The results indicate that the extra add instructions should not matter - memory accesses (the mov instructions) are the bottleneck.

	Yep, then lets work without this change for now,
no need to deviate much from the general u64_stats usage
if difference is negligible.

Regards

--
Julian Anastasov <ja@xxxxxx>