Re: performance degrades over time on CentOS 7

Rich Megginson <rmeggins@xxxxxxxxxx> · Tue, 15 Nov 2016 13:08:32 -0700

On 11/15/2016 12:58 PM, Marc Sauton wrote:
What is the test filter like?
Can we see a sanitized sample of the access log with the SRCH and RESULT?

If using SSL, review the output of
cat /proc/sys/kernel/random/entropy_avail

Do we have replication? (and large attribute values?)

You may want to run the "dbmon.sh" script to monitor cache usage for 
db cache and entry cache, try to capture a few samples of line about 
 dbcachefree and userroot:ent (if the db with the problems is 
userroot), when the searches are becoming too long, like this example:
INCR=1 HOST=m2.example.com <http://m2.example.com> 
BINDDN="cn=directory manager" BINDPW="password" VERBOSE=2 
/usr/sbin/dbmon.sh

and review the ns-slapd errors and system messages log files for any 
unusual activity.

what is the ns-slapd memory foot print from restart to slow responses?
any "too high" disk i/o? (or "bad" ssd?)

It is also useful to get a few stacktraces which will give us detailed 
information about what the server is doing.  For example, if you can 
"catch" the server while it is misbehaving, and get stacktraces every 
second for 10 seconds. 
http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-hangs

Thanks,
M.

On Tue, Nov 15, 2016 at 11:40 AM, Gordon Messmer 
<gordon.messmer@xxxxxxxxx <mailto:gordon.messmer@xxxxxxxxx>> wrote:

    I'm trying to track down a problem we are seeing on two relatively
    lightly used instances on CentOS 7 (and previously on CentOS 6,
    which is no longer in use).  Our servers have 3624 entries
    according to last night's export (we export userRoot daily). 
    There are currently just over 400 connections established to each
    server.

    We have a local cron job that runs every 5 minutes that performs a
    simple query.  If it takes more than 7 seconds to get an answer,
    the attempt is aborted and another query issued.  If three
    consecutive test fail, the directory server is restarted.

    The issue we're seeing is that the longer the system is up, the
    more often checks will fail.  Restarting the directory does not
    resolve the problem.  Our servers have currently been up for 108
    days, and are restarting the service several times a day, as a
    result of the checks.  Only if we reboot the systems does the
    problem subside.

    CPU utilization seems relatively high for such a small directory,
    but it's not constant.  I tried to manually capture a bit of data
    with strace during a period when CPU use was bursting high. 
    During a capture of maybe two seconds, I saw most of the CPU time
    was spent in futex. usecs/call was fairly high for calls to futex
    and select, as detailed below.

    Since restarting the service doesn't fix the problem, it seems
    most likely that this is an OS bug, but I'm hoping that the list
    can help me identify other useful data to track down the problem. 
    Does anyone have any suggestions for what I can capture now, while
    I can sometimes observe the problem?  If I reboot, it'll take
    months before I can get any new data.

    % time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
     74.61    4.505251        3590      1255       340 futex
     17.65    1.065548        6660       160           select
      4.41    0.266344       88781         3         2 restart_syscall
      3.07    0.185566          50      3718           poll
      0.10    0.006185           2      3610           sendto
      0.09    0.005189        5189         1           fsync
      0.04    0.002134          37        58           write
      0.03    0.001618          27        61  setsockopt
      0.00    0.000111           3        36           recvfrom
      0.00    0.000078           1        57           read
      0.00    0.000014          14         1           fstat
      0.00    0.000003           2         2           accept
      0.00    0.000003           1         6           fcntl
      0.00    0.000002           1         2  getsockname
      0.00    0.000001           1         2           close
    ------ ----------- ----------- --------- --------- ----------------
    100.00    6.038047                  8972       342 total
    _______________________________________________
    389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
    <mailto:389-users@xxxxxxxxxxxxxxxxxxxxxxx>
    To unsubscribe send an email to
    389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
    <mailto:389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx>

_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx

_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx