On Sat, Nov 12, 2022 at 06:01:36PM +0200, Julian Anastasov wrote: > > Hello, > > On Sat, 12 Nov 2022, Jiri Wiesner wrote: > > > On Sat, Nov 12, 2022 at 10:00:01AM +0100, Jiri Wiesner wrote: > > > On Mon, Oct 31, 2022 at 04:56:43PM +0200, Julian Anastasov wrote: > > > > Use the provided u64_stats_t type to avoid > > > > load/store tearing. > > > > I think only 32-bit machines have a problem storing a 64-bit counter atomically. u64_stats_fetch_begin() and u64_stats_fetch_retry() should take care of that. > > Some compilers for 64-bit can use two 32-bit > loads (load tearing) while we require atomic 64-bit read > in case reader is interrupted by updater. That is why READ_ONCE > is used now. I don't know which compilers can do this. That is > why I switched to u64_stats_t to correctly use the u64_stats > interface. And our data is 8-byte aligned. 32-bit are using > seqcount_t, so they are protected for this problem. All right. It is counter-intuitive but I guess those compiles have their reasons. I think it would not hurt to expand the description of the patch with the explanation above. > > > Converting the per cpu stat to u64_stats_t means that the compiler cannot optimize the memory access and addition on x86_64. Previously, the summation of per cpu counters in ip_vs_chain_estimation() looked like: > > > 15b65: add (%rsi),%r14 > > > 15b68: add 0x8(%rsi),%r15 > > > 15b6c: add 0x10(%rsi),%r13 > > > 15b70: add 0x18(%rsi),%r12 > > > 15b74: add 0x20(%rsi),%rbp > > > The u64_stats_read() calls in ip_vs_chain_estimation() turned it into: > > > 159d5: mov (%rcx),%r11 > > > 159d8: mov 0x8(%rcx),%r10 > > > 159dc: mov 0x10(%rcx),%r9 > > > 159e0: mov 0x18(%rcx),%rdi > > > 159e4: mov 0x20(%rcx),%rdx > > > 159e8: add %r11,%r14 > > > 159eb: add %r10,%r13 > > > 159ee: add %r9,%r12 > > > 159f1: add %rdi,%rbp > > > 159f4: add %rdx,%rbx > > > I guess that is not a big deal because the mov should be the instruction taking the most time on account of accessing per cpu regions of other CPUs. The add will be fast. > > > > Another concern is the number of registers needed for the summation. Previously, 5 registers were needed. Now, 10 registers are needed. This would matter mostly if the size of the stack frame of ip_vs_chain_estimation() and the number of callee-saved registers stored to the stack changed, but introducing u64_stats_read() in ip_vs_chain_estimation() did not change that. > > It happens in this way probably because compiler should not > reorder two or more successive instances of READ_ONCE (rwonce.h), > something that hurts u64_stats_read(). As result, it multiplies > up to 5 times the needed temp storage (registers or stack). > > Another option would be to use something like > this below, try on top of v6. It is rude to add ifdefs here but > we should avoid unneeded code for 64-bit. On 32-bit, code is > more but operations are same. On 64-bit it should order > read+add one by one which can run in parallel. In my compile > tests it uses 3 times movq(READ_ONCE)+addq(to sum) with same temp > register which defeats fully parallel operations and for the > last 2 counters uses movq,movq,addq,addq with different > registers which can run in parallel. It seems there are no free > registers to make it for all 5 counters. > > With this patch the benefit is that compiler should > not hold all 5 values before adding them but on > x86 this patch shows some problem: it does not allocate > 5 new temp registers to read and add the values in parallel. > Without this patch, as you show it, it somehow allocates 5+5 > registers which allows parallel operations. > > If you think we are better with such change, we > can include it in patch 4. It will be better in case one day > we add more counters and there are more available registers. > You can notice the difference probably by comparing the > chain_max value in performance mode. I have tested the change to ip_vs_chain_estimation() and ran profiling with bus-cycles, which are proportional to time spent executing code. The number of estimator per kthread was the same in both tests - 88800. It turns out the change made the code slower: > # Util1 Util2 Diff Command Shared Object Symbol CPU > 1,124,932,796 1,143,867,951 18,935,155 ( 1.7%) ipvs-e:0:0 [ip_vs] all all > 16,480,274 16,853,752 373,478 ( 2.3%) ipvs-e:0:1 [ip_vs] all all > 0 21,987 21,987 (100.0%) ipvs-e:0:0 [kvm] all all > 4,622,992 4,366,150 -256,842 ( 5.6%) ipvs-e:0:1 [kernel.kallsyms] all all > 180,226,857 164,665,271 -15,561,586 ( 8.6%) ipvs-e:0:0 [kernel.kallsyms] all all The most significant component was the time spent in ip_vs_chain_estimation(): > # Util1 Util2 Diff Command Shared Object Symbol CPU > 1,124,414,110 1,143,352,233 18,938,123 ( 1.7%) ipvs-e:0:0 [ip_vs] ip_vs_chain_estimation all > 16,291,044 16,574,498 283,454 ( 1.7%) ipvs-e:0:1 [ip_vs] ip_vs_chain_estimation all > 189,230 279,254 90,024 ( 47.6%) ipvs-e:0:1 [ip_vs] ip_vs_estimation_kthread all > 518,686 515,718 -2,968 ( 0.6%) ipvs-e:0:0 [ip_vs] ip_vs_estimation_kthread all > 1,562,512 1,377,609 -184,903 ( 11.8%) ipvs-e:0:0 [kernel.kallsyms] kthread_should_stop all > 2,435,803 2,138,404 -297,399 ( 12.2%) ipvs-e:0:1 [kernel.kallsyms] _find_next_bit all > 16,304,577 15,786,553 -518,024 ( 3.2%) ipvs-e:0:0 [kernel.kallsyms] _raw_spin_lock all > 151,110,969 137,172,160 -13,938,809 ( 9.2%) ipvs-e:0:0 [kernel.kallsyms] _find_next_bit all The disassembly shows it is the mov instructions (there is a skid of 1 instruction) what takes the most time: >Percent | Source code & Disassembly of kcore for bus-cycles (55354 samples, percent: local period) > : ffffffffc0bad980 <ip_vs_chain_estimation>: > v6 patchset v6 patchset + ip_vs_chain_estimation() change >------------------------------------------------------------------------------------------------------- > 0.00 : nopl 0x0(%rax,%rax,1) 0.00 : nopl 0x0(%rax,%rax,1) > 0.00 : push %r15 0.00 : push %r15 > 0.00 : push %r14 0.00 : push %r14 > 0.00 : push %r13 0.00 : push %r13 > 0.00 : push %r12 0.00 : push %r12 > 0.00 : push %rbp 0.00 : push %rbp > 0.00 : push %rbx 0.00 : push %rbx > 0.00 : sub $0x8,%rsp 0.00 : sub $0x8,%rsp > 0.00 : mov (%rdi),%r15 0.00 : mov (%rdi),%r15 > 0.00 : test %r15,%r15 0.00 : test %r15,%r15 > 0.00 : je 0xffffffffc0badaf1 0.00 : je 0xffffffffc0b7faf1 > 0.00 : call 0xffffffff8bcd33a0 0.00 : call 0xffffffffa1ed33a0 > 0.00 : test %al,%al 0.00 : test %al,%al > 0.00 : jne 0xffffffffc0badaf1 0.00 : jne 0xffffffffc0b7faf1 > 0.00 : mov -0x334ccb22(%rip),%esi 0.00 : mov -0x1d29eb22(%rip),%esi > 0.00 : xor %eax,%eax 0.00 : xor %eax,%eax > 0.00 : xor %ebx,%ebx 0.00 : xor %ebx,%ebx > 0.00 : xor %ebp,%ebp 0.00 : xor %ebp,%ebp > 0.00 : xor %r12d,%r12d 0.00 : xor %r12d,%r12d > 0.00 : xor %r13d,%r13d 0.00 : xor %r13d,%r13d > 0.05 : xor %r14d,%r14d 0.08 : xor %r14d,%r14d > 0.00 : jmp 0xffffffffc0bad9f7 0.00 : jmp 0xffffffffc0b7f9f7 > 0.14 : movslq %eax,%rdx 0.08 : movslq %eax,%rdx > 0.00 : mov 0x68(%r15),%rcx 0.00 : mov 0x68(%r15),%rcx > 11.92 : add $0x1,%eax 12.25 : add $0x1,%eax > 0.00 : add -0x72ead560(,%rdx,8),%rcx 0.00 : add -0x5ccad560(,%rdx,8),%rcx > 4.07 : mov (%rcx),%r11 3.84 : mov (%rcx),%rdx > 34.28 : mov 0x8(%rcx),%r10 34.17 : add %rdx,%r14 > 10.35 : mov 0x10(%rcx),%r9 2.62 : mov 0x8(%rcx),%rdx > 7.28 : mov 0x18(%rcx),%rdi 9.33 : add %rdx,%r13 > 7.70 : mov 0x20(%rcx),%rdx 1.82 : mov 0x10(%rcx),%rdx > 6.80 : add %r11,%r14 6.11 : add %rdx,%r12 > 0.27 : add %r10,%r13 1.26 : mov 0x18(%rcx),%rdx > 0.36 : add %r9,%r12 6.00 : add %rdx,%rbp > 2.24 : add %rdi,%rbp 2.97 : mov 0x20(%rcx),%rdx > 1.21 : add %rdx,%rbx 5.78 : add %rdx,%rbx > 0.07 : movslq %eax,%rdx 1.08 : movslq %eax,%rdx > 0.35 : mov $0xffffffff8d6e0860,%rdi 0.00 : mov $0xffffffffa38e0860,%rdi > 2.26 : call 0xffffffff8c16e270 2.62 : call 0xffffffffa236e270 > 0.17 : mov -0x334ccb7c(%rip),%esi 0.11 : mov -0x1d29eb7c(%rip),%esi > 0.00 : cmp %eax,%esi 0.00 : cmp %eax,%esi > 0.00 : ja 0xffffffffc0bad9c3 0.00 : ja 0xffffffffc0b7f9c3 > 0.00 : lea 0x70(%r15),%rdx 0.00 : lea 0x70(%r15),%rdx > 0.00 : mov %rdx,%rdi 0.00 : mov %rdx,%rdi > 0.04 : mov %rdx,(%rsp) 0.05 : mov %rdx,(%rsp) > 0.00 : call 0xffffffff8c74d800 0.00 : call 0xffffffffa294d800 > 0.06 : mov %r14,%rax 0.07 : mov %r14,%rax > 0.00 : sub 0x20(%r15),%rax 0.00 : sub 0x20(%r15),%rax > 9.55 : mov 0x38(%r15),%rcx 8.97 : mov 0x38(%r15),%rcx > 0.01 : mov (%rsp),%rdx 0.03 : mov (%rsp),%rdx The results indicate that the extra add instructions should not matter - memory accesses (the mov instructions) are the bottleneck. -- Jiri Wiesner SUSE Labs