std::chrono is much slower than native requests...?

Paul Smith <paul@xxxxxxxxxxxxxxxxx> · Sat, 06 Jul 2019 10:12:51 -0400

I'm using GCC 8.1 (built myself) on GNU/Linux for this testing.  The
code is compiled with -O2 -g.

I replaced the arch-specific implementation of a function we use a LOT
which returns a value in microseconds that we compare to get elapsed
time (but never convert into a timestamp) with std::chrono like this:

  using std::chrono;
  return time_point_cast<microseconds>(steady_clock::now())
             .time_since_epoch().count();

Unfortunately after I did this I got pinged by the perf folks that I
had introduced a 24% slowdown in some benchmarks.  These benchmarks are
testing general application operations, and they compute elapsed time
using a Python framework so that computation is not impacted by the
above change.  The slowdown is widespread across many tests.  I can
reproduce this slowdown on my local system.

I've tried it with both steady_clock and system_clock and both show the
same slowdown.

I looked at the code in libstdc++-v3 and as far as I can tell I'm using
CLOCK_MONOTONIC for my steady_clock; this is in libstdc++-v3/config.h:

  #define _GLIBCXX_USE_CLOCK_GETTIME_SYSCALL 1
  #define _GLIBCXX_USE_CLOCK_MONOTONIC 1
  #define _GLIBCXX_USE_CLOCK_REALTIME 1

But if I replace the above code with a direct call to clock_gettime():

  struct timespec ts;
  clock_gettime(CLOCK_MONOTONIC, &ts);
  return ts.tv_sec * 1000000ULL + ts.tv_nsec / 1000;

then I get back all my performance!

Any idea why the std::chrono implementation is so much slower?  I
looked at the implementation and I didn't see anything obvious.

Here's a gdb disassembly of the "fast" clock_gettime() version:

   <+0>:   push   %rbp
   <+1>:   mov    $0x1,%edi
   <+6>:   mov    %rsp,%rbp
   <+9>:   lea    -0x10(%rsp),%rsp
   <+14>:  lea    -0x10(%rbp),%rsi
   <+18>:  callq  0x40c730 <clock_gettime@plt>
   <+23>:  mov    -0x8(%rbp),%rcx
   <+27>:  movabs $0x20c49ba5e353f7cf,%rdx
   <+37>:  imul   $0xf4240,-0x10(%rbp),%rsi
   <+45>:  mov    %rcx,%rax
   <+48>:  imul   %rdx
   <+51>:  sar    $0x3f,%rcx
   <+55>:  leaveq
   <+56>:  sar    $0x7,%rdx
   <+60>:  sub    %rcx,%rdx
   <+63>:  lea    (%rsi,%rdx,1),%rax
   <+67>:  retq
     xchg   %ax,%ax
     nopw   %cs:0x0(%rax,%rax,1)

And here's a disassembly of the slow chrono version:

   <+0>:   push   %rbp
   <+1>:   mov    %rsp,%rbp
   <+4>:   callq  0xe62720 <_ZNSt6chrono3_V212steady_clock3nowEv>
   <+9>:   movabs $0x20c49ba5e353f7cf,%rdx
   <+19>:  mov    %rax,%rcx
   <+22>:  imul   %rdx
   <+25>:  sar    $0x3f,%rcx
   <+29>:  pop    %rbp
   <+30>:  sar    $0x7,%rdx
   <+34>:  mov    %rdx,%rax
   <+37>:  sub    %rcx,%rax
   <+40>:  retq
     nop
     nopw   0x0(%rax,%rax,1)

And of the std::chrono::steady_clock::now() function:

Dump of assembler code for function _ZNSt6chrono3_V212steady_clock3nowEv:
   <+0>:     sub    $0x18,%rsp
   <+4>:     mov    $0x1,%esi
   <+9>:     mov    $0xe4,%edi
   <+14>:    xor    %eax,%eax
   <+16>:    mov    %rsp,%rdx
   <+19>:    callq  0x40b950 <syscall@plt>
   <+24>:    imul   $0x3b9aca00,(%rsp),%rax
   <+32>:    add    0x8(%rsp),%rax
   <+37>:    add    $0x18,%rsp
   <+41>:    retq

My assembly is pretty lame, so maybe I'm missing something, but it
seems pretty close to the same thing to me, and does appear to be using
CLOCK_MONOTONIC (which is code 1) in the syscall.