Hi all, Thanks to Howard's advice on mutrace, and the addition of systemtap probes I have some more data and a few things I can start to action and work on to improve the server performance I hope. The next step was two fold - measure the timing of "do_search" excluding the connection code, and to identify lock contention with mutrace. Both yielded good data. Due to an issue with systemtap, it appears that hooking on process().function is not always reliable - I believe this could be related to inlinig of some function names. This meant that I needed to add probe points to allow the tools to hook reliably to the code at known points. What it also allows is the probes to exist in the middle of a function so we can break functions down and see how they behave. For example in do_search I was able to add probes for the function entry, query preparation, op_shared_search, the finalisation and the function return. This is the histogram of the "full" do_search: ^CDistribution of do_search_full latencies (in nanoseconds) for 776771 samples max/avg/min: 1576111187091993/2029055283/46 value |-------------------------------------------------- count 8 | 0 16 | 0 32 | 106 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 755031 128 |@ 18609 256 | 1886 512 | 394 1024 | 135 2048 | 56 4096 | 67 8192 | 199 16384 | 0 32768 | 0 65536 | 243 131072 | 23 262144 | 1 524288 | 20 1048576 | 0 2097152 | 0 ~ 281474976710656 | 0 562949953421312 | 0 1125899906842624 | 1 2251799813685248 | 0 4503599627370496 | 0 These are the three stages (preparation, op_shared execution, and finalisation). Distribution of do_search_prepared latencies (in nanoseconds) for 776772 samples max/avg/min: 1576111187091369/2029052531/3 value |-------------------------------------------------- count 0 | 0 1 | 0 2 | 7 4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 612338 8 |@@@@@@@@@@@ 144976 16 |@ 18222 32 | 873 64 | 189 128 | 127 256 | 30 512 | 5 1024 | 0 2048 | 1 4096 | 1 8192 | 2 16384 | 0 32768 | 0 ~ 281474976710656 | 0 562949953421312 | 0 1125899906842624 | 1 2251799813685248 | 0 4503599627370496 | 0 Distribution of do_search_complete latencies (in nanoseconds) for 776772 samples max/avg/min: 662727/137/38 value |-------------------------------------------------- count 8 | 0 16 | 0 32 | 14543 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 748736 128 | 10848 256 | 1613 512 | 302 1024 | 127 2048 | 54 4096 | 66 8192 | 196 16384 | 0 32768 | 0 65536 | 243 131072 | 23 262144 | 1 524288 | 20 1048576 | 0 2097152 | 0 Distribution of do_search_finalise latencies (in nanoseconds) for 776771 samples max/avg/min: 8940/1/1 value |-------------------------------------------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 338484 2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 414648 4 | 3582 8 |@@ 18382 16 | 1497 32 | 81 64 | 44 128 | 38 256 | 8 512 | 5 1024 | 1 2048 | 0 4096 | 0 8192 | 1 16384 | 0 32768 | 0 So once again in do_search_full we can see spikes through out the whole execution of do_search. Observing do_search_prepared, we can see one major spike (Which I have no idea what from ...), but otherwise most of the values stay near the 8 - 256ns mark. Not bad indeed! This probably won't yield much improvement to optimise. In do_search_finalise, this really is just "log access" and some frees. Again, mostly this stays low, but a few spikes go up to 512+ns. We'll come back to this. And ofcourse, in do_search_complete, this is measuring op_shared_search. This is where we expect most of the work, and here is where we see the spikes in latency occuring. So next I dug into op_shared search: Distribution of op_shared_search_full latencies (in nanoseconds) for 744467 samples max/avg/min: 1576124962139717/19054067382/41 value |-------------------------------------------------- count 8 | 0 16 | 0 32 | 144 64 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 728581 128 | 13318 256 | 1724 512 | 240 1024 | 48 2048 | 42 4096 | 39 8192 | 114 16384 | 0 32768 | 2 65536 | 154 131072 | 9 262144 | 33 524288 | 10 1048576 | 0 2097152 | 0 ~ 281474976710656 | 0 562949953421312 | 0 1125899906842624 | 9 2251799813685248 | 0 4503599627370496 | 0 Distribution of op_shared_search_prepared latencies (in nanoseconds) for 744462 samples max/avg/min: 1576124962139653/6351398480/7 value |-------------------------------------------------- count 1 | 0 2 | 0 4 | 1 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 578754 16 |@@@@@@@@@@@@@ 160359 32 | 3822 64 | 358 128 | 700 256 | 224 512 | 51 1024 | 17 2048 | 23 4096 | 22 8192 | 79 16384 | 0 32768 | 2 65536 | 0 131072 | 6 262144 | 31 524288 | 10 1048576 | 0 2097152 | 0 ~ 281474976710656 | 0 562949953421312 | 0 1125899906842624 | 3 2251799813685248 | 0 4503599627370496 | 0 Distribution of op_shared_search_complete latencies (in nanoseconds) for 744467 samples max/avg/min: 1576124962139598/12702711719/19 value |-------------------------------------------------- count 4 | 0 8 | 0 16 | 132 32 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 544392 64 |@@@@@@@@@@@@@@@@@ 195765 128 | 3206 256 | 647 512 | 105 1024 | 14 2048 | 17 4096 | 13 8192 | 21 16384 | 0 32768 | 0 65536 | 147 131072 | 2 262144 | 0 524288 | 0 ~ 281474976710656 | 0 562949953421312 | 0 1125899906842624 | 6 2251799813685248 | 0 4503599627370496 | 0 Distribution of op_shared_search_finalise latencies (in nanoseconds) for 744467 samples max/avg/min: 1576124909455854/6351355714/2 value |-------------------------------------------------- count 0 | 0 1 | 0 2 | 1 4 | 1 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 644654 16 |@@@@@@@ 95535 32 | 3005 64 | 364 128 | 688 256 | 166 512 | 15 1024 | 5 2048 | 2 4096 | 4 8192 | 13 16384 | 0 32768 | 0 65536 | 8 131072 | 3 262144 | 0 524288 | 0 ~ 281474976710656 | 0 562949953421312 | 0 1125899906842624 | 3 2251799813685248 | 0 4503599627370496 | 0 Again we have probes through out, measuring the full time, the preparation, backend execution, and finalisation stages of the query. In the full we see most operations in the 64ns mark but there are some spikes up to 524288, which is a bit concerning. we shouldn't be seeing such drastic spikes. This is mind, we can quickly identify that it's in the query preparation that this occurs, as it's the only histogram that replicates the spikes up to 524288. We can also see a few spikes in the backend through the jumps to 65536. So this means we probably need to look at the preparation and backend parts of search next to work out what deeper parts are causing this. I also followed up to check the log access times based on the finalise from do_search. Again we break this into time to prepare a log entry, and then the time to buffer it. Distribution of log_access_full latencies (in nanoseconds) for 943213 samples max/avg/min: 573628/19/4 value |-------------------------------------------------- count 1 | 0 2 | 0 4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 813261 8 |@@@@@ 85422 16 |@@ 41677 32 | 1567 64 | 308 128 | 575 256 | 139 512 | 49 1024 | 14 2048 | 23 4096 | 8 8192 | 130 16384 | 0 32768 | 10 65536 | 0 131072 | 10 262144 | 10 524288 | 10 1048576 | 0 2097152 | 0 Distribution of log_access_prepared latencies (in nanoseconds) for 943213 samples max/avg/min: 2281/4/2 value |-------------------------------------------------- count 0 | 0 1 | 0 2 |@@@@@@@@@@ 149924 4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 734833 8 |@@ 35399 16 |@ 21736 32 | 721 64 | 219 128 | 290 256 | 73 512 | 14 1024 | 3 2048 | 1 4096 | 0 8192 | 0 Distribution of log_access_complete latencies (in nanoseconds) for 943213 samples max/avg/min: 573624/14/0 value |-------------------------------------------------- count 0 | 1035 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 822635 2 |@@@@@ 93607 4 | 5363 8 |@ 17309 16 | 2494 32 | 130 64 | 71 128 | 267 256 | 58 512 | 34 1024 | 11 2048 | 21 4096 | 8 8192 | 130 16384 | 0 32768 | 10 65536 | 0 131072 | 10 262144 | 10 524288 | 10 1048576 | 0 2097152 | 0 >From this we can see that the time to prepare the entry is of almost no consequence. It's the log_append_buffer2 function that can take up to 524288ns to execute, when in the "normal" case it can be 8ns. That's a 65536x degredation in performance! Next was to look at the most contended mutexs. This generated a lot of data, so I'll only include the top results. mutrace: Showing 23867 mutexes in order of (write) contention count: Mutex # Locked Changed Cont. cont.Time[ms] tot.Time[ms] avg.Time[ms] Flags 21209 8348033 931370 202356 3728.122 960.397 0.000 M-.--. 4593 5616358 1434100 85812 195.660 428.424 0.000 M-.a-. 6556 2089049 1841307 44116 1441.666 2430.681 0.001 M-.a-. 6983 2368664 307564 36171 14461.861 17827.474 0.008 M-.r-. 6976 2344301 303284 33996 14139.214 18479.060 0.008 M-.r-. 6974 2421485 311208 33944 11542.366 15021.549 0.006 M-.r-. 6975 2366735 304595 33769 12534.205 17024.613 0.007 M-.r-. 6977 2298666 298851 33745 15032.955 19988.776 0.009 M-.r-. 6982 2354910 304511 33690 14949.869 18772.182 0.008 M-.r-. 21187 5565777 691964 33006 551.482 547.347 0.000 M-.--. 6978 2288445 298816 32834 16983.236 20672.063 0.009 M-.r-. 6980 2330792 304620 31609 15700.067 19536.323 0.008 M-.r-. 6981 2301734 298641 31374 16685.940 20538.976 0.009 M-.r-. 6979 2257093 295274 31100 17128.837 21545.489 0.010 M-.r-. 38 1395342 1088064 20207 508.864 2613.544 0.002 M-.a-. 6710 1392690 1257073 8006 84.416 176.202 0.000 M-.a-. 967 1391484 673342 6016 39.003 138.278 0.000 M-.a-. 6709 1394704 1086851 4235 94.286 174.136 0.000 M-.a-. 21111 1511494 655694 2991 62.206 304.168 0.000 M-.--. 21174 120109 10777 2316 64.759 15.292 0.000 M-.--. 29 1391466 1081661 2222 35772.114 3425.260 0.002 M-.a-. 25 695996 647807 1174 9.887 83.750 0.000 M-.a-. 6194 1394704 1088158 558 16.913 203.315 0.000 M-.a-. 4595 36131 7253 193 2.617 3.034 0.000 M-.a-. 21106 143568 34113 143 8.706 44.730 0.000 M-.--. 21167 24045 6548 112 1.354 3.898 0.000 M-.--. The columns of interest is Cont. which is "how many times the mutux/rwlock was contended. Cont.time is "how long was it contended for". tot tells us "how long the lock was locked for. This shows the top two most contended locks are: Mutex #21209 (0x0x7fd8a5b27868) first referenced by: /usr/local/lib64/libmutrace.so(pthread_mutex_init+0xea) [0x7fd9c7d788fa] /usr/lib64/libdb-4.8.so(__db_pthread_mutex_init+0x17a) [0x7fd8a687fc6a] Mutex #4593 (0x0x2138358) first referenced by: /usr/local/lib64/libmutrace.so(pthread_mutex_init+0xea) [0x7fd9c7d788fa] /usr/lib64/libnspr4.so(PR_NewMonitor+0x47) [0x7fd9c7373be7] /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_instance_create+0x84) [0x7fd8a665438e] /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(+0x590d7) [0x7fd8a66730d7] /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_instance_add_instance_entry_callback+0x105) [0x7fd8a6673833] /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_config_read_instance_entries+0x133) [0x7fd8a6661847] /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_config_load_dse_info+0x270) [0x7fd8a6661b4f] /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(dblayer_setup+0xea) [0x7fd8a663fa98] /opt/dirsrv/lib64/dirsrv/plugins/libback-ldbm.so(ldbm_back_start+0x69) [0x7fd8a6685931] /opt/dirsrv/lib64/dirsrv/libslapd.so.0(+0xc5bea) [0x7fd9c7cc6bea] /opt/dirsrv/lib64/dirsrv/libslapd.so.0(+0xc5a80) [0x7fd9c7cc6a80] /opt/dirsrv/lib64/dirsrv/libslapd.so.0(+0xc52e1) [0x7fd9c7cc62e1] /opt/dirsrv/lib64/dirsrv/libslapd.so.0(plugin_startall+0x39) [0x7fd9c7cc6a0f] /opt/dirsrv/sbin/ns-slapd(main+0xb81) [0x4271ca] /lib64/libc.so.6(__libc_start_main+0xeb) [0x7fd9c7772e0b] Everything needs access to the db. However the contended "time" was reasonable low, so I think these aren't "too" concerning. It's the 4th through 10th locks that concerned me as they had very large "contended" times. Mutex #6983 (0x0x27132e8) first referenced by: /usr/local/lib64/libmutrace.so(pthread_mutex_init+0xea) [0x7fd9c7d788fa] /opt/dirsrv/sbin/ns-slapd(connection_table_get_connection+0x1d2) [0x41c940] /opt/dirsrv/sbin/ns-slapd() [0x420c69] /opt/dirsrv/sbin/ns-slapd() [0x41e481] /opt/dirsrv/sbin/ns-slapd(slapd_daemon+0x643) [0x41f11c] /opt/dirsrv/sbin/ns-slapd(main+0xc5f) [0x4272a8] /lib64/libc.so.6(__libc_start_main+0xeb) [0x7fd9c7772e0b] They all have the same backtrace - these will be the connection locks, and the contention will be the attempts to try-lock these to find a free slot. Given that I already made one change to conntable and yielded a 30% performance improvement this year, I think this is the obvious first port to look at, and once it's improved we'll be able to more easily see the "next bottleneck". Actionable Items: I've now submitted the stap tools and the changes needed in a PR. The obvious code changes are: * improving the access log buffer (https://pagure.io/389-ds-base/issue/50785) * improving connection table slot acquisition (https://pagure.io/389-ds-base/issue/50786 I plan to start work on this asap.). Further investigations: * "preparation" steps in opshared_search * The backend query times and profiling that. Hope that helps, -- Sincerely, William _______________________________________________ 389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-devel@xxxxxxxxxxxxxxxxxxxxxxx