On Wed, Mar 25, 2020 at 3:16 PM Peter Cordes <peter@xxxxxxxxx> wrote: > > Nice optimization, and yes converting to hex is something CPUs can do > really efficiently. Not surprising that scanf and other overhead was > a huge part of the total time. I have some suggestions to make it > even more efficient: Thanks, I will submit a new version. > > * Use static const for your lookup tables (avoiding GOT overhead vs. > globals), and make them only 16 bytes not 36 done > * Make the helper function take its first 2 args in the same order as > the external API so it doesn't need to shuffle registers around before > tailcalling, just put that 3rd arg in a register and jmp. (The > function is big enough that GCC chooses not to inline it into all both > different callers at -O2, but will at -O3.) done > * The variant with the fixed default could be defined as an alias for > the existing one, I think, so it can have the same address as the > lower or upper function. This saves code size by not having another > lea / jmp, or worse a 3rd copy of the whole function if it inlines. I have done it using __attribute__ alias like: void uuid_unparse(const uuid_t uu, char *out) __attribute__((alias("uuid_unparse_lower"))); > * Avoid extra loads of the source data by reading into a tmp var, > instead of re-accessing the same uuid[i] twice, the 2nd time after a > store to *p which might overlap; the compiler can't prove otherwise. Done > > If you really are bottlenecking on UUID throughput, see my SIMD answer > on https://stackoverflow.com/questions/53823756/how-to-convert-a-binary-integer-number-to-a-hex-string > with x86 SSE2 (baseline for x86-64), SSSE3, AVX2 variable-shift, and > AVX512VBMI integer -> hex manual vectorization that can do 8 input > bytes -> 16 hex digits at once. Or with YMM vectors, 32 hex digits. > The asm should be straightforward to translate to intrinsics. (Remove > the part that reverses the byte-order from x86 little-endian to > printing order, since uuid bytes are apparently already in the right > order). You'd need some shuffling to store to the right places around > the '-' formatting but that's doable. Using an 8-byte store that > overlaps where you want a '-', *then* storing the '-', then a 4-byte > store of the bottom of the register, could work well to avoid one > shuffle with SSE2. Or with SSSE3 or higher, use pshufb to shuffle in > the '-' bytes I will take a look at it, but in a second time, I get your idea. I am not familiar with this, nice way to jumb on SIMD operations. > > On Tue, Mar 24, 2020 at 6:35 PM Aurelien LAJOIE <orel@xxxxxxxxx> wrote: > > > > > libuuid/src/unparse.c | 35 +++++++++++++++++------------------ > > 1 file changed, 17 insertions(+), 18 deletions(-) > > > > diff --git a/libuuid/src/unparse.c b/libuuid/src/unparse.c > > index a95bbb042..62bb3ef26 100644 > > --- a/libuuid/src/unparse.c > > +++ b/libuuid/src/unparse.c > > @@ -36,41 +36,40 @@ > > > > #include "uuidP.h" > > > > -static const char *fmt_lower = > > - "%08x-%04x-%04x-%02x%02x-%02x%02x%02x%02x%02x%02x"; > > - > > -static const char *fmt_upper = > > - "%08X-%04X-%04X-%02X%02X-%02X%02X%02X%02X%02X%02X"; > > +char const __str_digits_lower[36] = "0123456789abcdefghijklmnopqrstuvwxyz"; > > +char const __str_digits_upper[36] = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"; > > > Shouldn't these be static, not global, like the old format strings > were? Unless/until we want them somewhere else as well, these should > probably be private, and only 16 bytes each because the only current > use is for mapping a nibble -> hex ASCII. > > Also, leading double underscore identifiers are reserved for use by > the C implementation. If we want to use a global name, it should > probably be uuid__str_digits_lower or something like that. And if > not, the name can be hexdigits_lower / upper. > > > > > -static void uuid_unparse_x(const uuid_t uu, char *out, const char *fmt) > > +static void uuid_fmt(char *buf, const uuid_t uuid, char const fmt[36]) > > { > > + char *p = buf; > > + for (int i = 0; i < 16; i++) { > > + if (i == 4 || i == 6 || i == 8 || i == 10) { > > + *p++ = '-'; > > + } > > + *p++ = fmt[uuid[i] >> 4]; > > + *p++ = fmt[uuid[i] & 15]; > > > It's slightly more efficient to load into unsigned tmp; the compiler > can't prove that buf and uuid don't overlap so it actually reloads for > the 2nd statement. This is bad because we're pretty much going to > bottleneck or close to it on throughput of load/store instructions, on > a typical modern x86 for example. > > char *restrict out would solve the same problem, and presumably no > caller would ever pass overlapping buffers. And if they did, would > rather have this function read the original bytes instead of reloading > hex-digit bytes as binary UUID bytes. Although at that point we're > into UB territory. > > > > > + } > > + *p = '\0'; > > } > > > > void uuid_unparse_lower(const uuid_t uu, char *out) > > { > > - uuid_unparse_x(uu, out, fmt_lower); > > + uuid_fmt(out, uu, __str_digits_lower); > > } > > > > Best to have uuid_fmt take args in the same order as these wrappers, > so if a compiler decides not to inline it, the wrapper functions can > just put the digit-table pointer into another register and tailcall. > For example, your version compiles like this for x86-64: (on the > Godbolt compiler explorer, clang -O2 -fPIC) > > uuid_unparse_lower_orig: > mov rax, rdi > lea rdx, [rip + hexdigits_lower] # add a 3rd arg > mov rdi, rsi > mov rsi, rax # swap the first 2 args > jmp uuid_fmt_orig # TAILCALL > > vs. > > uuid_unparse_lower: > lea rdx, hexdigits_lower[rip] # add a 3rd arg > jmp uuid_fmt # and tailcall > > uuid_t is a typedef for an array of unsigned char[16] so as a function > arg it's just a pointer. > > https://godbolt.org/z/fcZFhi is what I've been playing around with, in > case I don't get back to this and actually make a patch myself.