Re: [PATCH 03/15] print_integer: new and improved way of printing integers

Alexey Dobriyan <adobriyan@xxxxxxxxx> · Tue, 21 Apr 2020 19:49:24 +0300

On Mon, Apr 20, 2020 at 09:54:17PM -0400, Steven Rostedt wrote:
> On Tue, 21 Apr 2020 00:27:23 +0300
> Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx> wrote:
> 
> > >   
> > > > 	TODO
> > > > 	benchmark with mainline because nouveau is broken for me -(
> > > > 	vsnprintf() changes make the code slower  
> > > 
> > > Exactly main point of this exercise. I don't believe that algos in vsprintf.c
> > > are too dumb to use division per digit (yes, division by constant which is not
> > > power of two is a heavy operation).
> > >   
> > 
> > And second point here, why not to use existing algos from vsprintf.c?
> 
> Exactly. The code in _print_integer_u32() doesn't look as fast as the
> code in vsprintf() that happens to use lookup tables and converts
> without any loops.
> 
> Hint, loops are bad, they cause the CPU to slow down.

Oh, come on! Loops make code fit into icache and μop decode cache.

> Anyway, this patch series would require a pretty good improvement, as
> the code replacing the sprintf() usages is pretty ugly compared to a
> simple sprintf() call.

No! Fast code must look ugly. Or in other words if you try to optimise
integer printing to death you'll probably end with something like
_print_integer().

When the very first patch changed /proc/stat to seq_put_decimal_ull()
the speed up was 66% (or 33%). That's how slow printing was back then.
It can be made slightly faster even now.

> Randomly picking patch 6:
> 
>  static int loadavg_proc_show(struct seq_file *m, void *v)
>  {
>  	unsigned long avnrun[3];
>  
>  	get_avenrun(avnrun, FIXED_1/200, 0);
>  
> 	seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu %u/%d %d\n",
> 		LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
> 		LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
> 		LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]),
> 		nr_running(), nr_threads,
> 		idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
>  	return 0;
>  }
> 
>   *vs* 
> 
>  static int loadavg_proc_show(struct seq_file *m, void *v)
>  {
>  	unsigned long avnrun[3];
> 	char buf[3 * (LEN_UL + 1 + 2 + 1) + 10 + 1 + 10 + 1 + 10 + 1];
> 	char *p = buf + sizeof(buf);
> 	int i;
> 
> 	*--p = '\n';
> 	p = _print_integer_u32(p, idr_get_cursor(&task_active_pid_ns(current)->idr) - 1);
> 	*--p = ' ';
> 	p = _print_integer_u32(p, nr_threads);
> 	*--p = '/';
> 	p = _print_integer_u32(p, nr_running());
> 
>  	get_avenrun(avnrun, FIXED_1/200, 0);
> 	for (i = 2; i >= 0; i--) {
> 		*--p = ' ';
> 		--p;		/* overwritten */
> 		*--p = '0';	/* conditionally overwritten */
> 		(void)_print_integer_u32(p + 2, LOAD_FRAC(avnrun[i]));
> 		*--p = '.';
> 		p = _print_integer_ul(p, LOAD_INT(avnrun[i]));
> 	}
>  
> 	seq_write(m, p, buf + sizeof(buf) - p);
>  	return 0;
>  }
> 
> 
> I much rather keep the first version.

I did the benchmarks (without stack protector though), everything except
/proc/cpuinfo and /proc/meminfo became faster. This requires investigation
and I can drop vsprintf() changes until then.

Now given that /proc/uptime format cast in stone, code may look a bit ugly
and unusual but it won't require maintainance