On Tue, Apr 21, 2020 at 12:19:11AM +0300, Andy Shevchenko wrote: > On Mon, Apr 20, 2020 at 11:57:31PM +0300, Alexey Dobriyan wrote: > > Time honored way to print integers via vsnprintf() or equivalent has > > unavoidable slowdown of parsing format string. This can't be fixed in C, > > without introducing external preprocessor. > > > > seq_put_decimal_ull() partially saves the day, but there are a lot of > > branches inside and overcopying still. > > > > _print_integer_*() family of functions is meant to make printing > > integers as fast as possible by deleting format string parsing and doing > > as little work as possible. > > > > It is based on the following observations: > > > > 1) memcpy is done in forward direction > > it can be done backwards but nobody does that, > > > > 2) digits can be extracted in a very simple loop which costs only > > 1 multiplication and shift (division by constant is not division) > > > > All the above asks for the following signature, semantics and pattern of > > printing out beloved /proc files: > > > > /* seq_printf(seq, "%u %llu\n", A, b); */ > > > > char buf[10 + 1 + 20 + 1]; > > char *p = buf + sizeof(buf); > > > > *--p = '\n'; > > p = _print_integer_u64(p, B); > > *--p = ' '; > > p = _print_integer_u32(p, A); > > > > seq_write(seq, p, buf + sizeof(buf) - p); > > > > 1) stack buffer capable of holding the biggest string is allocated. > > > > 2) "p" is pointer to start of the string. Initially it points past > > the end of the buffer WHICH IS NOT NUL-TERMINATED! > > > > 3) _print_integer_*() actually prints an integer from right to left > > and returns new start of the string. > > > > <--------| > > 123 > > ^ > > | > > +-- p > > > > 4) 1 character is printed with > > > > *--p = 'x'; > > > > It generates very efficient code as multiple writes can be > > merged. > > > > 5) fixed string is printed with > > > > p = memcpy(p - 3, "foo", 3); > > > > Complers know what memcpy() does and write-combine it. > > 4/8-byte writes become 1 instruction and are very efficient. > > > > 6) Once everything is printed, the result is written to seq_file buffer. > > It does only one overflow check and 1 copy. > > > > This generates very efficient code (and small!). > > > > In regular seq_printf() calls, first argument and format string are > > constantly reloaded. Format string will most likely with [rip+...] which > > is quite verbose. > > > > seq_put_decimal_ull() will do branches (and even more branches > > with "width" argument) > > > > > TODO > > benchmark with mainline because nouveau is broken for me -( > > vsnprintf() changes make the code slower > > Exactly main point of this exercise. I don't believe that algos in vsprintf.c > are too dumb to use division per digit (yes, division by constant which is not > power of two is a heavy operation). It is not about division. It is about fucktons of branches in vsprintf(). > > +noinline > > +char *_print_integer_u32(char *p, u32 x) > > +{ > > + do { > > + *--p = '0' + (x % 10); > > + } while (x /= 10); > > + return p; > > +} > > > +noinline > > +char *_print_integer_u64(char *p, u64 x) > > +{ > > + while (x >= 100 * 1000 * 1000) { > > + u32 r; > > + > > + x = div_u64_rem(x, 100 * 1000 * 1000, &r); > > + p = memset(p - 8, '0', 8); > > + (void)_print_integer_u32(p + 8, r); > > + } > > + return _print_integer_u32(p, x); > > +} > > -- > With Best Regards, > Andy Shevchenko > >