Re: [PATCH] libuuid: improve uuid_unparse() performance

Peter Cordes <peter@xxxxxxxxx> · Wed, 25 Mar 2020 11:16:18 -0300

Nice optimization, and yes converting to hex is something CPUs can do
really efficiently.  Not surprising that scanf and other overhead was
a huge part of the total time.  I have some suggestions to make it
even more efficient:

* Use static const for your lookup tables (avoiding GOT overhead vs.
globals), and make them only 16 bytes not 36
* Make the helper function take its first 2 args in the same order as
the external API so it doesn't need to shuffle registers around before
tailcalling, just put that 3rd arg in a register and jmp.  (The
function is big enough that GCC chooses not to inline it into all both
different callers at -O2, but will at -O3.)
* The variant with the fixed default could be defined as an alias for
the existing one, I think, so it can have the same address as the
lower or upper function.  This saves code size by not having another
lea / jmp, or worse a 3rd copy of the whole function if it inlines.
* Avoid extra loads of the source data by reading into a tmp var,
instead of re-accessing the same uuid[i] twice, the 2nd time after a
store to *p which might overlap; the compiler can't prove otherwise.

If you really are bottlenecking on UUID throughput, see my SIMD answer
on https://stackoverflow.com/questions/53823756/how-to-convert-a-binary-integer-number-to-a-hex-string
with x86 SSE2 (baseline for x86-64), SSSE3, AVX2 variable-shift, and
AVX512VBMI integer -> hex manual vectorization that can do 8 input
bytes -> 16 hex digits at once.  Or with YMM vectors, 32 hex digits.
The asm should be straightforward to translate to intrinsics.  (Remove
the part that reverses the byte-order from x86 little-endian to
printing order, since uuid bytes are apparently already in the right
order).  You'd need some shuffling to store to the right places around
the '-' formatting but that's doable.  Using an 8-byte store that
overlaps where you want a '-', *then* storing the '-', then a 4-byte
store of the bottom of the register, could work well to avoid one
shuffle with SSE2.  Or with SSSE3 or higher, use pshufb to shuffle in
the '-' bytes

On Tue, Mar 24, 2020 at 6:35 PM Aurelien LAJOIE <orel@xxxxxxxxx> wrote:

>
>  libuuid/src/unparse.c | 35 +++++++++++++++++------------------
>  1 file changed, 17 insertions(+), 18 deletions(-)
>
> diff --git a/libuuid/src/unparse.c b/libuuid/src/unparse.c
> index a95bbb042..62bb3ef26 100644
> --- a/libuuid/src/unparse.c
> +++ b/libuuid/src/unparse.c
> @@ -36,41 +36,40 @@
>
>  #include "uuidP.h"
>
> -static const char *fmt_lower =
> -       "%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x";
> -
> -static const char *fmt_upper =
> -       "%08X-%04X-%04X-%02X%02X-%02X%02X%02X%02X%02X%02X";
> +char const __str_digits_lower[36] = "0123456789abcdefghijklmnopqrstuvwxyz";
> +char const __str_digits_upper[36] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";

Shouldn't these be static, not global, like the old format strings
were?  Unless/until we want them somewhere else as well, these should
probably be private, and only 16 bytes each because the only current
use is for mapping a nibble -> hex ASCII.

Also, leading double underscore identifiers are reserved for use by
the C implementation.  If we want to use a global name, it should
probably be uuid__str_digits_lower or something like that.  And if
not, the name can be hexdigits_lower / upper.

>
> -static void uuid_unparse_x(const uuid_t uu, char *out, const char *fmt)
> +static void uuid_fmt(char *buf, const uuid_t uuid, char const fmt[36])
>  {
> +       char *p = buf;
> +       for (int i = 0; i < 16; i++) {
> +               if (i == 4 || i == 6 || i == 8 || i == 10) {
> +                       *p++ = '-';
> +               }
> +               *p++ = fmt[uuid[i] >> 4];
> +               *p++ = fmt[uuid[i] & 15];

It's slightly more efficient to load into unsigned tmp; the compiler
can't prove that buf and uuid don't overlap so it actually reloads for
the 2nd statement.  This is bad because we're pretty much going to
bottleneck or close to it on throughput of load/store instructions, on
a typical modern x86 for example.

char *restrict out would solve the same problem, and presumably no
caller would ever pass overlapping buffers.  And if they did, would
rather have this function read the original bytes instead of reloading
hex-digit bytes as binary UUID bytes.  Although at that point we're
into UB territory.

>
> +       }
> +       *p = '\0';
>  }
>
>  void uuid_unparse_lower(const uuid_t uu, char *out)
>  {
> -       uuid_unparse_x(uu, out, fmt_lower);
> +       uuid_fmt(out, uu, __str_digits_lower);
>  }
>

Best to have uuid_fmt take args in the same order as these wrappers,
so if a compiler decides not to inline it, the wrapper functions can
just put the digit-table pointer into another register and tailcall.
For example, your version compiles like this for x86-64: (on the
Godbolt compiler explorer, clang -O2 -fPIC)

uuid_unparse_lower_orig:
    mov rax, rdi
    lea rdx, [rip + hexdigits_lower]    # add a 3rd arg
    mov rdi, rsi
    mov rsi, rax             # swap the first 2 args
    jmp uuid_fmt_orig # TAILCALL

vs.

uuid_unparse_lower:
    lea rdx, hexdigits_lower[rip]          # add a 3rd arg
    jmp uuid_fmt                          # and tailcall

uuid_t is a typedef for an array of unsigned char[16] so as a function
arg it's just a pointer.

https://godbolt.org/z/fcZFhi is what I've been playing around with, in
case I don't get back to this and actually make a patch myself.