On Tue, 21 Apr 2020 19:49:24 +0300 Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote: > > Exactly. The code in _print_integer_u32() doesn't look as fast as the > > code in vsprintf() that happens to use lookup tables and converts > > without any loops. > > > > Hint, loops are bad, they cause the CPU to slow down. > > Oh, come on! Loops make code fit into icache and μop decode cache. Depends on the architecture. > > > Anyway, this patch series would require a pretty good improvement, as > > the code replacing the sprintf() usages is pretty ugly compared to a > > simple sprintf() call. > > No! Fast code must look ugly. Or in other words if you try to optimise > integer printing to death you'll probably end with something like > _print_integer(). As I stated, it will require a "pretty good improvement". There's always a trade off. If the improvement is noticeable for real life cases, then ugly code is worth it. If we are making ugly code for a benefit of something that never shows outside of noise, then the net cost (less maintainable code), is not worth it. > > When the very first patch changed /proc/stat to seq_put_decimal_ull() > the speed up was 66% (or 33%). That's how slow printing was back then. > It can be made slightly faster even now. I'd like to see the tests that your ran (to reproduce them myself). The first patch was making a 64bit number into 32bit number, thus shortening the work by half. > > > Randomly picking patch 6: > > > > static int loadavg_proc_show(struct seq_file *m, void *v) > > { > > unsigned long avnrun[3]; > > > > get_avenrun(avnrun, FIXED_1/200, 0); > > > > seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n", > > LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]), > > LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]), > > LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]), > > nr_running(), nr_threads, > > idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); > > return 0; > > } > > > > *vs* > > > > static int loadavg_proc_show(struct seq_file *m, void *v) > > { > > unsigned long avnrun[3]; > > char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1]; > > char *p = buf + sizeof(buf); > > int i; > > > > *--p = '\n'; > > p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); > > *--p = ' '; > > p = _print_integer_u32(p, nr_threads); > > *--p = '/'; > > p = _print_integer_u32(p, nr_running()); > > > > get_avenrun(avnrun, FIXED_1/200, 0); > > for (i = 2; i >= 0; i--) { > > *--p = ' '; > > --p; /* overwritten */ > > *--p = '0'; /* conditionally overwritten */ > > (void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i])); > > *--p = '.'; > > p = _print_integer_ul(p, LOAD_INT(avnrun[i])); > > } > > > > seq_write(m, p, buf + sizeof(buf) - p); > > return 0; > > } > > > > > > I much rather keep the first version. > > I did the benchmarks (without stack protector though), everything except > /proc/cpuinfo and /proc/meminfo became faster. This requires investigation > and I can drop vsprintf() changes until then. > > Now given that /proc/uptime format cast in stone, code may look a bit ugly > and unusual but it won't require maintainance Please share what you did as your benchmarks. If this is as good of a performance as you claim, then these changes would be worth looking into. So I applied your entire series, added the following patch: diff --git a/include/linux/spinlock_api_smp.h b/include/linux/spinlock_api_smp.h index 19a9be9d97ee..17c582d77ab7 100644 --- a/include/linux/spinlock_api_smp.h +++ b/include/linux/spinlock_api_smp.h @@ -152,9 +152,16 @@ static inline void __raw_spin_unlock(raw_spinlock_t *lock) preempt_enable(); } +extern u64 sched_clock(void); static inline void __raw_spin_unlock_irqrestore(raw_spinlock_t *lock, unsigned long flags) { + char buf[32]; + u64 start, stop; + start = sched_clock(); + sprintf(buf,"%lld", (unsigned long)lock); + stop = sched_clock(); + trace_printk("time: %lld '%s'\n", stop - start, buf); spin_release(&lock->dep_map, _RET_IP_); do_raw_spin_unlock(lock); local_irq_restore(flags); Then after boot up, I did the following: # trace-cmd stop # trace-cmd extract Which captured the traces: <idle>-0 [003] 5.405208: bprint: _raw_spin_unlock_irqrestore: time: 799 '-110308271193088' <idle>-0 [003] 5.405210: bprint: _raw_spin_unlock_irqrestore: time: 273 '-110308271193088' <idle>-0 [003] 5.412235: bprint: _raw_spin_unlock_irqrestore: time: 1138 '-110308271193088' <idle>-0 [003] 5.412236: bprint: _raw_spin_unlock_irqrestore: time: 213 '-110308271193088' <idle>-0 [003] 5.414241: bprint: _raw_spin_unlock_irqrestore: time: 1094 '-110308271193088' <idle>-0 [003] 5.414243: bprint: _raw_spin_unlock_irqrestore: time: 182 '-110308271193088' <idle>-0 [003] 5.418241: bprint: _raw_spin_unlock_irqrestore: time: 1113 '-110308271193088' Where "time: X", X is the delta in nanoseconds around the sprintf() call. The I ran the attached perl program on the output, and got the following results: Before your patches: # trace-cmd report | ./report.pl full_total = 52844823 average: 255.902176229032 std: 439.269729814847 And with your patches: # trace-cmd report | ./report.pl full_total = 84725476 average: 407.873274762306 std: 555.755670463724 As the standard deviation is bigger than the average, it appears to be all in the noise. Then I decided to see if it affects "ps -eux" Original: # perf stat -r 100 ps -eux > /dev/null Performance counter stats for 'ps -eux' (100 runs): 8.92 msec task-clock # 0.937 CPUs utilized ( +- 0.90% ) 5 context-switches # 0.545 K/sec ( +- 1.24% ) 0 cpu-migrations # 0.000 K/sec 259 page-faults # 0.029 M/sec ( +- 0.07% ) 32,973,751 cycles # 3.698 GHz ( +- 0.09% ) 17,254,307 stalled-cycles-frontend # 52.33% frontend cycles idle ( +- 0.17% ) 38,707,960 instructions # 1.17 insn per cycle # 0.45 stalled cycles per insn ( +- 0.01% ) 8,153,197 branches # 914.274 M/sec ( +- 0.01% ) 114,992 branch-misses # 1.41% of all branches ( +- 0.12% ) 0.0095170 +- 0.0000829 seconds time elapsed ( +- 0.87% ) With your patches: # perf stat -r 100 ps -eux > /dev/null Performance counter stats for 'ps -eux' (100 runs): 8.86 msec task-clock # 0.918 CPUs utilized ( +- 1.06% ) 5 context-switches # 0.527 K/sec ( +- 1.22% ) 0 cpu-migrations # 0.001 K/sec ( +-100.00% ) 259 page-faults # 0.029 M/sec ( +- 0.08% ) 32,699,168 cycles # 3.692 GHz ( +- 0.12% ) 16,995,861 stalled-cycles-frontend # 51.98% frontend cycles idle ( +- 0.21% ) 38,114,396 instructions # 1.17 insn per cycle # 0.45 stalled cycles per insn ( +- 0.03% ) 7,985,526 branches # 901.625 M/sec ( +- 0.03% ) 112,852 branch-misses # 1.41% of all branches ( +- 0.17% ) 0.009652 +- 0.000276 seconds time elapsed ( +- 2.86% ) Not really much difference. Then what about just catting /proc/$$/stat, and do a 1000 runs! Original: # perf stat -r 1000 cat /proc/$$/stat > /dev/null Performance counter stats for 'cat /proc/1622/stat' (1000 runs): 0.34 msec task-clock # 0.680 CPUs utilized ( +- 0.21% ) 0 context-switches # 0.071 K/sec ( +- 20.18% ) 0 cpu-migrations # 0.000 K/sec 65 page-faults # 0.192 M/sec ( +- 0.07% ) 993,486 cycles # 2.934 GHz ( +- 0.07% ) 577,903 stalled-cycles-frontend # 58.17% frontend cycles idle ( +- 0.09% ) 936,489 instructions # 0.94 insn per cycle # 0.62 stalled cycles per insn ( +- 0.07% ) 202,912 branches # 599.213 M/sec ( +- 0.07% ) 6,976 branch-misses # 3.44% of all branches ( +- 0.11% ) 0.00049797 +- 0.00000111 seconds time elapsed ( +- 0.22% ) With your patches: # perf stat -r 1000 cat /proc/$$/stat > /dev/null Performance counter stats for 'cat /proc/1624/stat' (1000 runs): 0.34 msec task-clock # 0.664 CPUs utilized ( +- 0.23% ) 0 context-switches # 0.018 K/sec ( +- 40.72% ) 0 cpu-migrations # 0.000 K/sec 65 page-faults # 0.190 M/sec ( +- 0.07% ) 988,430 cycles # 2.892 GHz ( +- 0.07% ) 574,841 stalled-cycles-frontend # 58.16% frontend cycles idle ( +- 0.09% ) 933,852 instructions # 0.94 insn per cycle # 0.62 stalled cycles per insn ( +- 0.07% ) 202,096 branches # 591.297 M/sec ( +- 0.07% ) 6,836 branch-misses # 3.38% of all branches ( +- 0.09% ) 0.00051476 +- 0.00000557 seconds time elapsed ( +- 1.08% ) They are pretty much identical. What's the purpose of all these changes again? There was no cover letter. -- Steve
Attachment:
report.pl
Description: Perl program