Follow up to system tap email

William Brown <william@xxxxxxxxxxxxxxxx> · Fri, 13 Dec 2019 09:35:06 +1000

Hi all,

Thanks to Howard's advice on mutrace, and the addition of systemtap probes I have some more data and a few things I can start to action and work on to improve the server performance I hope.

The next step was two fold - measure the timing of "do_search" excluding the connection code, and to identify lock contention with mutrace. Both yielded good data.

Due to an issue with systemtap, it appears that hooking on process().function is not always reliable - I believe this could be related to inlinig of some function names. This meant that I needed to add probe points to allow the tools to hook reliably to the code at known points. What it also allows is the probes to exist in the middle of a function so we can break functions down and see how they behave.

For example in do_search I was able to add probes for the function entry, query preparation, op_shared_search, the finalisation and the function return.

This is the histogram of the "full" do_search:

^CDistribution of do_search_full latencies (in nanoseconds) for 776771 samples
max/avg/min: 1576111187091993/2029055283/46
           value |-------------------------------------------------- count
               8 |                                                        0
              16 |                                                        0
              32 |                                                      106
              64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  755031
             128 |@                                                   18609
             256 |                                                     1886
             512 |                                                      394
            1024 |                                                      135
            2048 |                                                       56
            4096 |                                                       67
            8192 |                                                      199
           16384 |                                                        0
           32768 |                                                        0
           65536 |                                                      243
          131072 |                                                       23
          262144 |                                                        1
          524288 |                                                       20
         1048576 |                                                        0
         2097152 |                                                        0
                 ~
 281474976710656 |                                                        0
 562949953421312 |                                                        0
1125899906842624 |                                                        1
2251799813685248 |                                                        0
4503599627370496 |                                                        0

These are the three stages (preparation, op_shared execution, and finalisation).

Distribution of do_search_prepared latencies (in nanoseconds) for 776772 samples
max/avg/min: 1576111187091369/2029052531/3
           value |-------------------------------------------------- count
               0 |                                                        0
               1 |                                                        0
               2 |                                                        7
               4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  612338
               8 |@@@@@@@@@@@                                        144976
              16 |@                                                   18222
              32 |                                                      873
              64 |                                                      189
             128 |                                                      127
             256 |                                                       30
             512 |                                                        5
            1024 |                                                        0
            2048 |                                                        1
            4096 |                                                        1
            8192 |                                                        2
           16384 |                                                        0
           32768 |                                                        0
                 ~
 281474976710656 |                                                        0
 562949953421312 |                                                        0
1125899906842624 |                                                        1
2251799813685248 |                                                        0
4503599627370496 |                                                        0

Distribution of do_search_complete latencies (in nanoseconds) for 776772 samples
max/avg/min: 662727/137/38
  value |-------------------------------------------------- count
      8 |                                                        0
     16 |                                                        0
     32 |                                                    14543
     64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  748736
    128 |                                                    10848
    256 |                                                     1613
    512 |                                                      302
   1024 |                                                      127
   2048 |                                                       54
   4096 |                                                       66
   8192 |                                                      196
  16384 |                                                        0
  32768 |                                                        0
  65536 |                                                      243
 131072 |                                                       23
 262144 |                                                        1
 524288 |                                                       20
1048576 |                                                        0
2097152 |                                                        0

Distribution of do_search_finalise latencies (in nanoseconds) for 776771 samples
max/avg/min: 8940/1/1
value |-------------------------------------------------- count
    0 |                                                        0
    1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           338484
    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  414648
    4 |                                                     3582
    8 |@@                                                  18382
   16 |                                                     1497
   32 |                                                       81
   64 |                                                       44
  128 |                                                       38
  256 |                                                        8
  512 |                                                        5
 1024 |                                                        1
 2048 |                                                        0
 4096 |                                                        0
 8192 |                                                        1
16384 |                                                        0
32768 |                                                        0

So once again in do_search_full we can see spikes through out the whole execution of do_search. Observing do_search_prepared, we can see one major spike (Which I have no idea what from ...), but otherwise most of the values stay near the 8 - 256ns mark. Not bad indeed! This probably won't yield much improvement to optimise.

In do_search_finalise, this really is just "log access" and some frees. Again, mostly this stays low, but a few spikes go up to 512+ns. We'll come back to this.

And ofcourse, in do_search_complete, this is measuring op_shared_search. This is where we expect most of the work, and here is where we see the spikes in latency occuring.

So next I dug into op_shared search:

Distribution of op_shared_search_full latencies (in nanoseconds) for 744467 samples
max/avg/min: 1576124962139717/19054067382/41                                                                                                                                                                                                           value |-------------------------------------------------- count                                                                                                                                                                                 8 |                                                        0
              16 |                                                        0                                                                                                                                                                               32 |                                                      144                                                                                                                                                                               64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  728581
             128 |                                                    13318
             256 |                                                     1724
             512 |                                                      240
            1024 |                                                       48
            2048 |                                                       42
            4096 |                                                       39
            8192 |                                                      114
           16384 |                                                        0
           32768 |                                                        2
           65536 |                                                      154
          131072 |                                                        9
          262144 |                                                       33
          524288 |                                                       10
         1048576 |                                                        0
         2097152 |                                                        0
                 ~
 281474976710656 |                                                        0
 562949953421312 |                                                        0
1125899906842624 |                                                        9
2251799813685248 |                                                        0
4503599627370496 |                                                        0

Distribution of op_shared_search_prepared latencies (in nanoseconds) for 744462 samples
max/avg/min: 1576124962139653/6351398480/7
           value |-------------------------------------------------- count
               1 |                                                        0
               2 |                                                        0
               4 |                                                        1
               8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  578754
              16 |@@@@@@@@@@@@@                                      160359
              32 |                                                     3822
              64 |                                                      358
             128 |                                                      700
             256 |                                                      224
             512 |                                                       51
            1024 |                                                       17
            2048 |                                                       23
            4096 |                                                       22
            8192 |                                                       79
           16384 |                                                        0
           32768 |                                                        2
           65536 |                                                        0
          131072 |                                                        6
          262144 |                                                       31
          524288 |                                                       10
         1048576 |                                                        0
         2097152 |                                                        0
                 ~
 281474976710656 |                                                        0
 562949953421312 |                                                        0
1125899906842624 |                                                        3
2251799813685248 |                                                        0
4503599627370496 |                                                        0

Distribution of op_shared_search_complete latencies (in nanoseconds) for 744467 samples
max/avg/min: 1576124962139598/12702711719/19
           value |-------------------------------------------------- count
               4 |                                                        0
               8 |                                                        0
              16 |                                                      132
              32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  544392
              64 |@@@@@@@@@@@@@@@@@                                  195765
             128 |                                                     3206
             256 |                                                      647
             512 |                                                      105
            1024 |                                                       14
            2048 |                                                       17
            4096 |                                                       13
            8192 |                                                       21
           16384 |                                                        0
           32768 |                                                        0
           65536 |                                                      147
          131072 |                                                        2
          262144 |                                                        0
          524288 |                                                        0
                 ~
 281474976710656 |                                                        0
 562949953421312 |                                                        0
1125899906842624 |                                                        6
2251799813685248 |                                                        0
4503599627370496 |                                                        0

Distribution of op_shared_search_finalise latencies (in nanoseconds) for 744467 samples
max/avg/min: 1576124909455854/6351355714/2
           value |-------------------------------------------------- count
               0 |                                                        0
               1 |                                                        0
               2 |                                                        1
               4 |                                                        1
               8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  644654
              16 |@@@@@@@                                             95535
              32 |                                                     3005
              64 |                                                      364
             128 |                                                      688
             256 |                                                      166
             512 |                                                       15
            1024 |                                                        5
            2048 |                                                        2
            4096 |                                                        4
            8192 |                                                       13
           16384 |                                                        0
           32768 |                                                        0
           65536 |                                                        8
          131072 |                                                        3
          262144 |                                                        0
          524288 |                                                        0
                 ~
 281474976710656 |                                                        0
 562949953421312 |                                                        0
1125899906842624 |                                                        3
2251799813685248 |                                                        0
4503599627370496 |                                                        0

Again we have probes through out, measuring the full time, the preparation, backend execution, and finalisation stages of the query. In the full we see most operations in the 64ns mark but there are some spikes up to 524288, which is a bit concerning. we shouldn't be seeing such drastic spikes.

This is mind, we can quickly identify that it's in the query preparation that this occurs, as it's the only histogram that replicates the spikes up to 524288. 

We can also see a few spikes in the backend through the jumps to 65536. So this means we probably need to look at the preparation and backend parts of search next to work out what deeper parts are causing this. 

I also followed up to check the log access times based on the finalise from do_search. Again we break this into time to prepare a log entry, and then the time to buffer it. 

Distribution of log_access_full latencies (in nanoseconds) for 943213 samples
max/avg/min: 573628/19/4
  value |-------------------------------------------------- count
      1 |                                                        0
      2 |                                                        0
      4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  813261
      8 |@@@@@                                               85422
     16 |@@                                                  41677
     32 |                                                     1567
     64 |                                                      308
    128 |                                                      575
    256 |                                                      139
    512 |                                                       49
   1024 |                                                       14
   2048 |                                                       23
   4096 |                                                        8
   8192 |                                                      130
  16384 |                                                        0
  32768 |                                                       10
  65536 |                                                        0
 131072 |                                                       10
 262144 |                                                       10
 524288 |                                                       10
1048576 |                                                        0
2097152 |                                                        0

Distribution of log_access_prepared latencies (in nanoseconds) for 943213 samples
max/avg/min: 2281/4/2
value |-------------------------------------------------- count
    0 |                                                        0
    1 |                                                        0
    2 |@@@@@@@@@@                                         149924
    4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  734833
    8 |@@                                                  35399
   16 |@                                                   21736
   32 |                                                      721
   64 |                                                      219
  128 |                                                      290
  256 |                                                       73
  512 |                                                       14
 1024 |                                                        3
 2048 |                                                        1
 4096 |                                                        0
 8192 |                                                        0

Distribution of log_access_complete latencies (in nanoseconds) for 943213 samples
max/avg/min: 573624/14/0
  value |-------------------------------------------------- count
      0 |                                                     1035
      1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  822635
      2 |@@@@@                                               93607
      4 |                                                     5363
      8 |@                                                   17309
     16 |                                                     2494
     32 |                                                      130
     64 |                                                       71
    128 |                                                      267
    256 |                                                       58
    512 |                                                       34
   1024 |                                                       11
   2048 |                                                       21
   4096 |                                                        8
   8192 |                                                      130
  16384 |                                                        0
  32768 |                                                       10
  65536 |                                                        0
 131072 |                                                       10
 262144 |                                                       10
 524288 |                                                       10
1048576 |                                                        0
2097152 |                                                        0

>From this we can see that the time to prepare the entry is of almost no consequence. It's the log_append_buffer2 function that can take up to 524288ns to execute, when in the "normal" case it can be 8ns. That's a 65536x degredation in performance! 

Next was to look at the most contended mutexs. This generated a lot of data, so I'll only include the top results.

mutrace: Showing 23867 mutexes in order of (write) contention count:

 Mutex #   Locked  Changed    Cont. cont.Time[ms] tot.Time[ms] avg.Time[ms] Flags
   21209  8348033   931370   202356      3728.122      960.397        0.000 M-.--.
    4593  5616358  1434100    85812       195.660      428.424        0.000 M-.a-.
    6556  2089049  1841307    44116      1441.666     2430.681        0.001 M-.a-.
    6983  2368664   307564    36171     14461.861    17827.474        0.008 M-.r-.
    6976  2344301   303284    33996     14139.214    18479.060        0.008 M-.r-.
    6974  2421485   311208    33944     11542.366    15021.549        0.006 M-.r-.
    6975  2366735   304595    33769     12534.205    17024.613        0.007 M-.r-.
    6977  2298666   298851    33745     15032.955    19988.776        0.009 M-.r-.
    6982  2354910   304511    33690     14949.869    18772.182        0.008 M-.r-.
   21187  5565777   691964    33006       551.482      547.347        0.000 M-.--.
    6978  2288445   298816    32834     16983.236    20672.063        0.009 M-.r-.
    6980  2330792   304620    31609     15700.067    19536.323        0.008 M-.r-.
    6981  2301734   298641    31374     16685.940    20538.976        0.009 M-.r-.
    6979  2257093   295274    31100     17128.837    21545.489        0.010 M-.r-.
      38  1395342  1088064    20207       508.864     2613.544        0.002 M-.a-.
    6710  1392690  1257073     8006        84.416      176.202        0.000 M-.a-.
     967  1391484   673342     6016        39.003      138.278        0.000 M-.a-.
    6709  1394704  1086851     4235        94.286      174.136        0.000 M-.a-.
   21111  1511494   655694     2991        62.206      304.168        0.000 M-.--.
   21174   120109    10777     2316        64.759       15.292        0.000 M-.--.
      29  1391466  1081661     2222     35772.114     3425.260        0.002 M-.a-.
      25   695996   647807     1174         9.887       83.750        0.000 M-.a-.
    6194  1394704  1088158      558        16.913      203.315        0.000 M-.a-.
    4595    36131     7253      193         2.617        3.034        0.000 M-.a-.
   21106   143568    34113      143         8.706       44.730        0.000 M-.--.
   21167    24045     6548      112         1.354        3.898        0.000 M-.--.

The columns of interest is Cont. which is "how many times the mutux/rwlock was contended. Cont.time is "how long was it contended for". tot tells us "how long the lock was locked for.

This shows the top two most contended locks are:

Mutex #21209 (0x0x7fd8a5b27868) first referenced by:
        /usr/local/lib64/libmutrace.so(pthread_mutex_init+0xea) [0x7fd9c7d788fa]
        /usr/lib64/libdb-4.8.so(__db_pthread_mutex_init+0x17a) [0x7fd8a687fc6a]

Mutex #4593 (0x0x2138358) first referenced by:
        /usr/local/lib64/libmutrace.so(pthread_mutex_init+0xea) [0x7fd9c7d788fa]
        /usr/lib64/libnspr4.so(PR_NewMonitor+0x47) [0x7fd9c7373be7]
        /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_instance_create+0x84) [0x7fd8a665438e]
        /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(+0x590d7) [0x7fd8a66730d7]
        /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_instance_add_instance_entry_callback+0x105) [0x7fd8a6673833]
        /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_config_read_instance_entries+0x133) [0x7fd8a6661847]
        /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_config_load_dse_info+0x270) [0x7fd8a6661b4f]
        /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(dblayer_setup+0xea) [0x7fd8a663fa98]
        /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_back_start+0x69) [0x7fd8a6685931]
        /opt/dirsrv/lib64/dirsrv/libslapd.so.0(+0xc5bea) [0x7fd9c7cc6bea]
        /opt/dirsrv/lib64/dirsrv/libslapd.so.0(+0xc5a80) [0x7fd9c7cc6a80]
        /opt/dirsrv/lib64/dirsrv/libslapd.so.0(+0xc52e1) [0x7fd9c7cc62e1]
        /opt/dirsrv/lib64/dirsrv/libslapd.so.0(plugin_startall+0x39) [0x7fd9c7cc6a0f]
        /opt/dirsrv/sbin/ns-slapd(main+0xb81) [0x4271ca]
        /lib64/libc.so.6(__libc_start_main+0xeb) [0x7fd9c7772e0b]

Everything needs access to the db. However the contended "time" was reasonable low, so I think these aren't "too" concerning.

It's the 4th through 10th locks that concerned me as they had very large "contended" times. 

Mutex #6983 (0x0x27132e8) first referenced by:
        /usr/local/lib64/libmutrace.so(pthread_mutex_init+0xea) [0x7fd9c7d788fa]
        /opt/dirsrv/sbin/ns-slapd(connection_table_get_connection+0x1d2) [0x41c940]
        /opt/dirsrv/sbin/ns-slapd() [0x420c69]
        /opt/dirsrv/sbin/ns-slapd() [0x41e481]
        /opt/dirsrv/sbin/ns-slapd(slapd_daemon+0x643) [0x41f11c]
        /opt/dirsrv/sbin/ns-slapd(main+0xc5f) [0x4272a8]
        /lib64/libc.so.6(__libc_start_main+0xeb) [0x7fd9c7772e0b]

They all have the same backtrace - these will be the connection locks, and the contention will be the attempts to try-lock these to find a free slot. Given that I already made one change to conntable and yielded a 30% performance improvement this year, I think this is the obvious first port to look at, and once it's improved we'll be able to more easily see the "next bottleneck". 

Actionable Items:

I've now submitted the stap tools and the changes needed in a PR. 

The obvious code changes are:

* improving the access log buffer  (https://pagure.io/389-ds-base/issue/50785)
* improving connection table slot acquisition (https://pagure.io/389-ds-base/issue/50786 I plan to start work on this asap.). 

Further investigations:

* "preparation" steps in opshared_search
* The backend query times and profiling that.

Hope that helps,

--
Sincerely,

William
_______________________________________________
389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-devel@xxxxxxxxxxxxxxxxxxxxxxx