Hello, On Tue, 15 Nov 2022, Jiri Wiesner wrote: > On Sat, Nov 12, 2022 at 06:01:36PM +0200, Julian Anastasov wrote: > > > > Some compilers for 64-bit can use two 32-bit > > loads (load tearing) while we require atomic 64-bit read > > in case reader is interrupted by updater. That is why READ_ONCE > > is used now. I don't know which compilers can do this. That is > > why I switched to u64_stats_t to correctly use the u64_stats > > interface. And our data is 8-byte aligned. 32-bit are using > > seqcount_t, so they are protected for this problem. > > All right. It is counter-intuitive but I guess those compiles have their reasons. I think it would not hurt to expand the description of the patch with the explanation above. OK. I relied on the mentioned commit to explain the problem as other similar patches do. > > If you think we are better with such change, we > > can include it in patch 4. It will be better in case one day > > we add more counters and there are more available registers. > > You can notice the difference probably by comparing the > > chain_max value in performance mode. > > I have tested the change to ip_vs_chain_estimation() and ran profiling with bus-cycles, which are proportional to time spent executing code. The number of estimator per kthread was the same in both tests - 88800. It turns out the change made the code slower: > > # Util1 Util2 Diff Command Shared Object Symbol CPU > > 1,124,932,796 1,143,867,951 18,935,155 ( 1.7%) ipvs-e:0:0 [ip_vs] all all chain_max is probably 37 here and 1.7% is low enough (below 1/37) not to be accounted in chain_max. > > 16,480,274 16,853,752 373,478 ( 2.3%) ipvs-e:0:1 [ip_vs] all all > > 0 21,987 21,987 (100.0%) ipvs-e:0:0 [kvm] all all > > 4,622,992 4,366,150 -256,842 ( 5.6%) ipvs-e:0:1 [kernel.kallsyms] all all > > 180,226,857 164,665,271 -15,561,586 ( 8.6%) ipvs-e:0:0 [kernel.kallsyms] all all > The most significant component was the time spent in ip_vs_chain_estimation(): > > # Util1 Util2 Diff Command Shared Object Symbol CPU > > 1,124,414,110 1,143,352,233 18,938,123 ( 1.7%) ipvs-e:0:0 [ip_vs] ip_vs_chain_estimation all > > 16,291,044 16,574,498 283,454 ( 1.7%) ipvs-e:0:1 [ip_vs] ip_vs_chain_estimation all > > 189,230 279,254 90,024 ( 47.6%) ipvs-e:0:1 [ip_vs] ip_vs_estimation_kthread all > > 518,686 515,718 -2,968 ( 0.6%) ipvs-e:0:0 [ip_vs] ip_vs_estimation_kthread all > > 1,562,512 1,377,609 -184,903 ( 11.8%) ipvs-e:0:0 [kernel.kallsyms] kthread_should_stop all > > 2,435,803 2,138,404 -297,399 ( 12.2%) ipvs-e:0:1 [kernel.kallsyms] _find_next_bit all > > 16,304,577 15,786,553 -518,024 ( 3.2%) ipvs-e:0:0 [kernel.kallsyms] _raw_spin_lock all > > 151,110,969 137,172,160 -13,938,809 ( 9.2%) ipvs-e:0:0 [kernel.kallsyms] _find_next_bit all > > The disassembly shows it is the mov instructions (there is a skid of 1 instruction) what takes the most time: > >Percent | Source code & Disassembly of kcore for bus-cycles (55354 samples, percent: local period) > > : ffffffffc0bad980 <ip_vs_chain_estimation>: > > v6 patchset v6 patchset + ip_vs_chain_estimation() change > >------------------------------------------------------------------------------------------------------- > > 0.00 : add -0x72ead560(,%rdx,8),%rcx 0.00 : add -0x5ccad560(,%rdx,8),%rcx > > 4.07 : mov (%rcx),%r11 3.84 : mov (%rcx),%rdx > > 34.28 : mov 0x8(%rcx),%r10 34.17 : add %rdx,%r14 > > 10.35 : mov 0x10(%rcx),%r9 2.62 : mov 0x8(%rcx),%rdx > > 7.28 : mov 0x18(%rcx),%rdi 9.33 : add %rdx,%r13 > > 7.70 : mov 0x20(%rcx),%rdx 1.82 : mov 0x10(%rcx),%rdx > > 6.80 : add %r11,%r14 6.11 : add %rdx,%r12 > > 0.27 : add %r10,%r13 1.26 : mov 0x18(%rcx),%rdx > > 0.36 : add %r9,%r12 6.00 : add %rdx,%rbp > > 2.24 : add %rdi,%rbp 2.97 : mov 0x20(%rcx),%rdx Bad, same register RDX used. I guess, this is CONFIG_GENERIC_CPU=y kernel which is suitable for different CPUs. My example was tested with CONFIG_MCORE2=y. But the optimizations are compiler's job, there can be a cost to free registers for our operations. Who knows, change can be faster on other platforms with less available registers to hold the 5 counters and their sum in registers. Probably, we can force it to add mem8 into reg by using something like this: static inline void u64_stats_read_add(u64_stats_t *p, u64 *sum) { local64_add_to(&p->v, (long *) sum); } static inline void local_add_to(local_t *l, long *sum) { asm(_ASM_ADD "%1,%0" : "+r" (*sum) : "m" (l->a.counter)); } But it deserves more testing. > The results indicate that the extra add instructions should not matter - memory accesses (the mov instructions) are the bottleneck. Yep, then lets work without this change for now, no need to deviate much from the general u64_stats usage if difference is negligible. Regards -- Julian Anastasov <ja@xxxxxx>