On Mon, Apr 20, 2020 at 09:54:17PM -0400, Steven Rostedt wrote: > On Tue, 21 Apr 2020 00:27:23 +0300 > Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx> wrote: > > > > > > > > TODO > > > > benchmark with mainline because nouveau is broken for me -( > > > > vsnprintf() changes make the code slower > > > > > > Exactly main point of this exercise. I don't believe that algos in vsprintf.c > > > are too dumb to use division per digit (yes, division by constant which is not > > > power of two is a heavy operation). > > > > > > > And second point here, why not to use existing algos from vsprintf.c? > > Exactly. The code in _print_integer_u32() doesn't look as fast as the > code in vsprintf() that happens to use lookup tables and converts > without any loops. > > Hint, loops are bad, they cause the CPU to slow down. Oh, come on! Loops make code fit into icache and μop decode cache. > Anyway, this patch series would require a pretty good improvement, as > the code replacing the sprintf() usages is pretty ugly compared to a > simple sprintf() call. No! Fast code must look ugly. Or in other words if you try to optimise integer printing to death you'll probably end with something like _print_integer(). When the very first patch changed /proc/stat to seq_put_decimal_ull() the speed up was 66% (or 33%). That's how slow printing was back then. It can be made slightly faster even now. > Randomly picking patch 6: > > static int loadavg_proc_show(struct seq_file *m, void *v) > { > unsigned long avnrun[3]; > > get_avenrun(avnrun, FIXED_1/200, 0); > > seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n", > LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]), > LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]), > LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]), > nr_running(), nr_threads, > idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); > return 0; > } > > *vs* > > static int loadavg_proc_show(struct seq_file *m, void *v) > { > unsigned long avnrun[3]; > char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1]; > char *p = buf + sizeof(buf); > int i; > > *--p = '\n'; > p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1); > *--p = ' '; > p = _print_integer_u32(p, nr_threads); > *--p = '/'; > p = _print_integer_u32(p, nr_running()); > > get_avenrun(avnrun, FIXED_1/200, 0); > for (i = 2; i >= 0; i--) { > *--p = ' '; > --p; /* overwritten */ > *--p = '0'; /* conditionally overwritten */ > (void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i])); > *--p = '.'; > p = _print_integer_ul(p, LOAD_INT(avnrun[i])); > } > > seq_write(m, p, buf + sizeof(buf) - p); > return 0; > } > > > I much rather keep the first version. I did the benchmarks (without stack protector though), everything except /proc/cpuinfo and /proc/meminfo became faster. This requires investigation and I can drop vsprintf() changes until then. Now given that /proc/uptime format cast in stone, code may look a bit ugly and unusual but it won't require maintainance