Hi all, thanks for answering, it took some time until I came again to this problem. Am 09.08.2018 um 10:44 schrieb Ludwig Krispenz: > > On 08/09/2018 01:53 AM, William Brown wrote: >> Sadly this doesn't tell us much :( > we could get a pstack along with iotop to see which threads do teh IO, unfortunately pstack doesn't give any output, even with debugging symbols installed: ii 389-ds-base 1.3.3.5-4+deb8u1 amd64 389 Directory Server suite - server ii 389-ds-base-dbg 1.3.3.5-4+deb8u1 amd64 389 Directory Server suite - server debugging symbols ii 389-ds-base-libs 1.3.3.5-4+deb8u1 amd64 389 Directory Server suite - libraries ii 389-ds-base-libs-dbg 1.3.3.5-4+deb8u1 amd64 389 Directory Server suite - library debugging symbols pstack 718 718: /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-ldap1 -i /var/run/dirsrv/slapd-ldap1.pid -w /var/run/dirsrv/slapd-ldap1.startpid (No symbols found) 0x7f69b42b37bc: ???? (7f69a0018270, 0, 7f69a00824a0, 0, 7f69b5c1e616, 564c254998e0) + ffffd6e2892ce4a0 0x3000001b3: ???? (0, 200000, 0, 31, 3, 564c252df680) + ffffa9b3da9bbfe3 crawl: Input/output error Error tracing through process 718 0x564c242aaa00: ???? > regular mods or the BDB regulars like trickle, checkpointing .... >> >> It would be good to see the ldbm config here I think. It would also be >> good to know what kind of work load you have. In some configurations of >> the network, things like load balancers could be sending more work load ist generally ok. With two cores ~ 0.80, i/o wait is little but CPU is sometimes on 100 % ore more (look here: https://nextcloud.datenkollektiv.net/s/XEHmCyscEt4j9cn). >> read/write traffic to one server, rather than over all of them. dn: cn=ldbm database,cn=plugins,cn=config objectClass: top objectClass: nsSlapdPlugin objectClass: extensibleObject cn: ldbm database nsslapd-pluginPath: libback-ldbm nsslapd-pluginInitfunc: ldbm_back_init nsslapd-pluginType: database nsslapd-pluginEnabled: on nsslapd-plugin-depends-on-type: Syntax nsslapd-plugin-depends-on-type: matchingRule nsslapd-pluginId: ldbm-backend nsslapd-pluginVersion: 1.3.3.5 nsslapd-pluginVendor: 389 Project nsslapd-pluginDescription: high-performance LDAP backend database plugin numSubordinates: 58 example for ldbm database config: dn: cn=example_net,cn=ldbm database,cn=plugins,cn=config objectClass: top objectClass: extensibleobject objectClass: nsbackendinstance cn: notraces_net creatorsName: cn=directory manager modifiersName: cn=directory manager createTimestamp: 20171214123209Z modifyTimestamp: 20171214123209Z numSubordinates: 4 nsslapd-suffix: dc=notraces,dc=net nsslapd-cachesize: -1 nsslapd-cachememsize: 1346371 nsslapd-readonly: off nsslapd-require-index: off nsslapd-directory: /var/lib/dirsrv/slapd-ldap0/db/notraces_net nsslapd-dncachememsize: 512000 The issue: even if I set the dncachememsize to some higher value (while dirsrv is off), after starting it the values are changed by dirsrv. >> Another spot to check is your log buffer size, as it may be small >> (causing the threads to all be writing small ammounts). The logging >> subsystem is slated for a rework in the future to make it more >> scalable. access log is off. Only audit log is enabled - and there is not much. Probably the special setup with a lot of single databases (one per email-domain > 50 dbs) is responsible. Maybe the high write rates are not really responsible for poor ldap-performance (look at my thread on this list with "ldapsearch performance problem" in june). It's still the same. Sometimes (a few times per minute) the ldap-server becomes unresponsive for one up to several seconds. I testet it with a loop like: while true; do time ldapsearch -h localhost -o ldif-wrap=no -D "cn=directory manager" -w secret -s sub -b "dc=datenkollektiv.net,dc=net" 'objectclass=*' dn > /dev/null; done Interesting: if one of these queries hang for some time every other query from other hosts also hang for exactly the same time. We wrote a strace along with the queries (output here: https://nextcloud.datenkollektiv.net/s/XEHmCyscEt4j9cn) so I see e.g. for a query which took about 5.489 seconds (timestamp is above): in query.log: 12:55:39.745018658 real 0m5.489s user 0m0.024s sys 0m0.040s 12:55:45.237697886 real 0m1.625s user 0m0.004s sys 0m0.008s 12:55:46.869872903 in strace.log: [pid 8088] 12:55:39.739539 poll([{fd=435, events=POLLOUT}], 1, 1800000 <unfinished ...> [pid 8058] 12:55:39.739573 <... write resumed> ) = 1 <0.000087> [pid 8088] 12:55:39.739723 <... poll resumed> ) = 1 ([{fd=435, revents=POLLOUT}]) <0.000168> [pid 8058] 12:55:39.739754 write(426, "dn: cn=ntU"..., 344 <unfinished ...> [pid 8088] 12:55:39.739824 sendto(435, "0<\2\1\2d7\0043u"..., 62, 0, NULL, 0 <unfinished ...> [pid 8058] 12:55:39.739858 <... write resumed> ) = 344 <0.000086> [pid 8088] 12:55:39.739917 <... sendto resumed> ) = 62 <0.000077> [pid 8058] 12:55:39.739944 write(426, "\n", 1 <unfinished ...> [pid 8088] 12:55:39.740032 clock_gettime(CLOCK_MONOTONIC, <unfinished ...> [pid 8058] 12:55:39.740061 <... write resumed> ) = 1 <0.000100> [pid 8088] 12:55:39.740191 <... clock_gettime resumed> {1083541, 905025629}) = 0 <0.000149> [pid 8058] 12:55:39.740223 write(426, "dn: cn=ntU"..., 352 <unfinished ...> [pid 8088] 12:55:39.740296 poll([{fd=435, events=POLLOUT}], 1, 1800000 <unfinished ...> [pid 8058] 12:55:39.740331 <... write resumed> ) = 352 <0.000090> [pid 8088] 12:55:39.740391 <... poll resumed> ) = 1 ([{fd=435, revents=POLLOUT}]) <0.000079> [pid 8058] 12:55:39.740423 write(426, "\n", 1 <unfinished ...> [pid 13776] 12:55:39.740502 <... select resumed> ) = 0 (Timeout) <0.129345> [pid 8088] 12:55:39.740528 sendto(435, "0\f\2\1\2e\7\n\1\0"..., 14, 0, NULL, 0 <unfinished ...> [pid 8058] 12:55:39.740562 <... write resumed> ) = 1 <0.000123> [pid 13776] 12:55:39.740705 gettimeofday( <unfinished ...> [pid 8088] 12:55:39.740739 <... sendto resumed> ) = 14 <0.000191> [pid 8058] 12:55:39.740760 write(426, "dn: cn=num"..., 355 <unfinished ...> [pid 13776] 12:55:39.740952 <... gettimeofday resumed> {1534244139, 740898}, NULL) = 0 <0.000229> [pid 8088] 12:55:39.740986 setsockopt(435, SOL_TCP, TCP_CORK, [0], 4 <unfinished ...> [pid 8058] 12:55:39.741019 <... write resumed> ) = 355 <0.000240> [pid 13776] 12:55:39.742191 poll([{fd=436, events=POLLIN|POLLPRI}], 1, 0 <unfinished ...> [pid 8088] 12:55:39.742258 <... setsockopt resumed> ) = 0 <0.001252> [pid 8058] 12:55:39.742289 write(426, "\n", 1 <unfinished ...> [pid 13776] 12:55:39.743439 <... poll resumed> ) = 0 (Timeout) <0.001206> [pid 8088] 12:55:39.743487 write(409, "\0", 1 <unfinished ...> [pid 8058] 12:55:39.743529 <... write resumed> ) = 1 <0.001219> [pid 13776] 12:55:39.748812 select(0, NULL, NULL, NULL, {0, 256000} <unfinished ...> [pid 13772] 12:55:39.748893 <... select resumed> ) = 0 (Timeout) <0.258082> [pid 13772] 12:55:39.748934 gettimeofday({1534244139, 748952}, NULL) = 0 <0.000019> [pid 13772] 12:55:39.749001 poll([{fd=434, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) <0.000022> [pid 13772] 12:55:39.749072 select(0, NULL, NULL, NULL, {0, 512000} <unfinished ...> [pid 8088] 12:55:39.749148 <... write resumed> ) = 1 <0.005636> [pid 8058] 12:55:39.749189 write(426, "dn: cn=obj"..., 345 <unfinished ...> [pid 8088] 12:55:39.749254 futex(0x564c333401dc, FUTEX_WAIT_PRIVATE, 6381055, NULL <unfinished ...> [pid 8058] 12:55:39.749314 <... write resumed> ) = 345 <0.000103> [pid 8058] 12:55:39.749360 write(426, "\n", 1) = 1 <0.000036> [pid 8058] 12:55:39.749638 write(426, "dn: cn=own"..., 332) = 332 <0.000035> [pid 8058] 12:55:39.749729 write(426, "\n", 1) = 1 <0.000033> [pid 8058] 12:55:39.749956 write(426, "dn: cn=par"..., 337) = 337 <0.000036> [pid 8058] 12:55:39.750049 write(426, "\n", 1) = 1 <0.000035> [pid 8058] 12:55:39.750214 write(426, "dn: cn=see"..., 336) = 336 <0.000034> [pid 8058] 12:55:39.750303 write(426, "\n", 1) = 1 <0.000032> [pid 8058] 12:55:39.750446 write(426, "dn: cn=sn,"..., 361) = 361 <0.000034> [pid 8058] 12:55:39.750532 write(426, "\n", 1) = 1 <0.000032> [pid 8058] 12:55:39.750663 write(426, "dn: cn=tar"..., 352) = 352 <0.000033> [pid 8058] 12:55:39.750747 write(426, "\n", 1) = 1 <0.000032> [pid 8058] 12:55:39.750877 write(426, "dn: cn=tel"..., 389) = 389 <0.000033> [pid 8058] 12:55:39.750963 write(426, "\n", 1) = 1 <0.000035> [pid 8058] 12:55:39.751111 write(426, "dn: cn=uid"..., 328) = 328 <0.000034> [pid 8058] 12:55:39.751199 write(426, "\n", 1) = 1 <0.000032> .... but this is not a regular pattern: We still don't have a clue if the performance problem is maybe caused from xen virtualisation or it's a dirsrv performance issue due to our configuration or from replication (three masters replicate to each other). But replication doesn't seem the problem since the issue still exists with disabled replications. Before upgrading to another machine (more performant) we didn't had these problems (but with replicas only between two servers). Debian Jessie 8.0 Kernel 4.9.0-0 Xen-Version Xen 4.8.4-pre Setup for Kolab-Groupware Thanks for ideas. Jan _______________________________________________ 389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx/message/JBE2CIZYJ4IIGKNBYUAIF4SHR3K62CCJ/