What is the test filter like?
Can we see a sanitized sample of the access log with the SRCH and RESULT?
If using SSL, review the output of
cat /proc/sys/kernel/random/entropy_avail
Do we have replication? (and large attribute values?)
You may want to run the "dbmon.sh" script to monitor cache usage for db cache and entry cache, try to capture a few samples of line about dbcachefree and userroot:ent (if the db with the problems is userroot), when the searches are becoming too long, like this example:
INCR=1 HOST=m2.example.com BINDDN="cn=directory manager" BINDPW="password" VERBOSE=2 /usr/sbin/dbmon.sh
and review the ns-slapd errors and system messages log files for any unusual activity.
what is the ns-slapd memory foot print from restart to slow responses?
any "too high" disk i/o? (or "bad" ssd?)
Thanks,
M.
On Tue, Nov 15, 2016 at 11:40 AM, Gordon Messmer <gordon.messmer@xxxxxxxxx> wrote:
I'm trying to track down a problem we are seeing on two relatively lightly used instances on CentOS 7 (and previously on CentOS 6, which is no longer in use). Our servers have 3624 entries according to last night's export (we export userRoot daily). There are currently just over 400 connections established to each server.
We have a local cron job that runs every 5 minutes that performs a simple query. If it takes more than 7 seconds to get an answer, the attempt is aborted and another query issued. If three consecutive test fail, the directory server is restarted.
The issue we're seeing is that the longer the system is up, the more often checks will fail. Restarting the directory does not resolve the problem. Our servers have currently been up for 108 days, and are restarting the service several times a day, as a result of the checks. Only if we reboot the systems does the problem subside.
CPU utilization seems relatively high for such a small directory, but it's not constant. I tried to manually capture a bit of data with strace during a period when CPU use was bursting high. During a capture of maybe two seconds, I saw most of the CPU time was spent in futex. usecs/call was fairly high for calls to futex and select, as detailed below.
Since restarting the service doesn't fix the problem, it seems most likely that this is an OS bug, but I'm hoping that the list can help me identify other useful data to track down the problem. Does anyone have any suggestions for what I can capture now, while I can sometimes observe the problem? If I reboot, it'll take months before I can get any new data.
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
74.61 4.505251 3590 1255 340 futex
17.65 1.065548 6660 160 select
4.41 0.266344 88781 3 2 restart_syscall
3.07 0.185566 50 3718 poll
0.10 0.006185 2 3610 sendto
0.09 0.005189 5189 1 fsync
0.04 0.002134 37 58 write
0.03 0.001618 27 61 setsockopt
0.00 0.000111 3 36 recvfrom
0.00 0.000078 1 57 read
0.00 0.000014 14 1 fstat
0.00 0.000003 2 2 accept
0.00 0.000003 1 6 fcntl
0.00 0.000002 1 2 getsockname
0.00 0.000001 1 2 close
------ ----------- ----------- --------- --------- ----------------
100.00 6.038047 8972 342 total
_______________________________________________
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
_______________________________________________ 389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx