Re: performance degrades over time on CentOS 7

William Brown <wibrown@xxxxxxxxxx> · Fri, 18 Nov 2016 10:05:38 +1000

On Wed, 2016-11-16 at 16:00 -0800, Gordon Messmer wrote:
> On 11/16/2016 01:23 PM, William Brown wrote:
> > What's your ioblocktimeout set to?
> 
> nsslapd-ioblocktimeout: 1800000

Hmm, that's default, but it's quite high, you could lower this if you
wanted. See:

https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/10.1/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#cnconfig-nsslapd_ioblocktimeout_IO_Block_Time_Out

> 
> > How many connections are idle on the server?
> 
> How would I check?

cn=monitor should show you if you search it.

> 
> > Are you seeing OOM behaviour or memory not being released to the OS?
> 
> No, the systems use very little memory:
> 
> # free
>                total        used        free      shared buff/cache   
> available
> Mem:        1883872      148932       72752       97156 1662188     1429468
> Swap:       2097148       65064     2032084
> 
> No OOM actions are recorded.
> 
> > What specs are your servers ie cpu and memory, is it ecc memory?
> 
> These are virtual machines with 4 allocated cores and 2GB of RAM. The 
> host systems are Intel(R) Xeon(R) CPU E5-2620 v3 with 64 of ECC RAM.  
> The two VMs running 389-ds are on different physical hosts, but have the 
> same problems at roughly the same frequency, at roughly the same uptime.
> 
> > What kind of disk are they on? Are there issues in dmesg?
> 
> One physical system has a RAID10 mdraid array of SAS disks.  The other 
> has a RAID1 mdraid array of SAS disks.  No errors have been recorded.
> 
> The virtual machines are LVM-backed with standard (not sparse) LVs.

That all sounds great, no issues there. 

> 
> > Have you configured system activity reporter (sar), and have out from
> > the same time of disk io, memory usage, cpu etc?
> 
> I believe that's set up by default, yes.
> 
> https://paste.fedoraproject.org/483468/93401501/

It looks like there is some activity just as it stops.

12:00:01 AM    proc/s   cswch/s
12:10:01 AM      4.91    393.92
12:20:01 AM      4.93    385.64
12:30:01 AM      2.79    312.10
12:40:01 AM      1.57    320.06
12:50:01 AM      0.84    282.43
01:00:01 AM      0.23    217.87
01:10:01 AM      0.38    276.50

12:00:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s  pgfree/s pgscank/s
pgscand/s pgsteal/s    %vmeff
12:10:01 AM      0.27     51.65   2192.75      0.00   1020.59      0.00
0.00      0.00      0.00
12:20:01 AM      0.32     59.12   2193.29      0.00   1018.92      0.00
0.00      0.00      0.00
12:30:01 AM      0.04    124.12   1242.81      0.04    559.39      0.00
0.00      0.00      0.00
12:40:01 AM      1.77     23.27    685.10      0.00    324.02      0.00
0.00      0.00      0.00
12:50:01 AM      1.05     89.72    866.32      0.01    490.28      0.00
0.00      0.00      0.00

Then at 3:30 there is some more paging:

03:30:01 AM     29.34    171.17    472.71      0.11    289.21      0.00
0.00      0.00      0.00

There appears to be an increase in disk activity here too:

12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz
await     svctm     %util
12:10:01 AM  dev252-0      7.86      0.55    103.30     13.21      8.90
1132.98     15.29     12.02
12:10:01 AM  dev253-0      0.05      0.52      0.00      9.75      0.01
238.28     20.59      0.11
12:10:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00
12:10:01 AM  dev253-2      8.82      0.03    103.30     11.72      9.06
1027.71     13.63     12.01
12:20:01 AM  dev252-0      9.48      0.64    118.24     12.55     12.15
1281.81     15.56     14.75
12:20:01 AM  dev253-0      0.10      0.64      0.16      8.00      0.03
251.33     25.82      0.26
12:20:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00
12:20:01 AM  dev253-2     10.50      0.00    118.08     11.25     12.30
1171.42     13.96     14.65
12:30:01 AM  dev252-0      4.39      0.08    248.25     56.57      5.20
1183.77     27.41     12.03
12:30:01 AM  dev253-0      0.20      0.08      3.30     16.87      0.04
199.68     31.60      0.63
12:30:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00
0.00      0.00      0.00
12:30:01 AM  dev253-2      5.27      0.00    244.94     46.48      5.37
1019.43     22.10     11.65

But the amount of activity is mirrored at 8:40pm, so it's probably okay.

In generally, it doesn't look too busy, and the memory usage seems okay.
No other process runs at 12:30, IE backups?

> 
> The DS stopped responding at about 12:30AM in this readout (system time 
> is in UTC).
> 
> > What's your sysctl setup like?
> 
> Standard for a CentOS 7 system, with these additions:
> 
> net.ipv4.ip_local_port_range = 1024 65000
> net.ipv4.tcp_keepalive_time = 600
> 
> 
> > Have you increased file descriptors for Directory Server?
> 
> I thought I had, but it looks like I haven't:
> 
> # cat /proc/sys/fs/file-max
> 185059
> # grep nofile /etc/security/limits.conf
> #        - nofile - max number of open file descriptors
> # grep ulimit /etc/profile
> 
> 
> nsslapd-maxdescriptors: 1024
> 
> > Have you lowered the TCP close wait timeout?
> 
> No.
> 
> > When I hear of problems like this, I'm always inclined to investigate
> > the host first, as there is a surprising amount that can affect DS from
> > the host.
> 
> I suspect so, too, since the problem correlates with the system uptime, 
> not how long the daemon has been running.  But beyond that I'm not sure 
> how to track this down further.

It may appear to be uptime, but may not. It would be worth checking
ss/netstat when the issue occurs next, see how many sockets are blocked
on CLOSE_WAIT, because it could be a resource exhaustion there which
makes it appear as though it's slowing, but really, the threads are
starved of work. I'd need to study your pstack outputs to be sure. 

Have you done any other Directory Server tuning (threads, memory
cache?). I think the out of box defaults are conservative, but in this
case probably aren't too bad for you.

Might also be worth checking your VM hosts at the time the issues occur,
see if you see a spike on other systems or storage.

-- 
Sincerely,

William Brown
Software Engineer
Red Hat, Brisbane
Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx