On Wed, 2016-11-16 at 16:00 -0800, Gordon Messmer wrote: > On 11/16/2016 01:23 PM, William Brown wrote: > > What's your ioblocktimeout set to? > > nsslapd-ioblocktimeout: 1800000 Hmm, that's default, but it's quite high, you could lower this if you wanted. See: https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/10.1/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#cnconfig-nsslapd_ioblocktimeout_IO_Block_Time_Out > > > How many connections are idle on the server? > > How would I check? cn=monitor should show you if you search it. > > > Are you seeing OOM behaviour or memory not being released to the OS? > > No, the systems use very little memory: > > # free > total used free shared buff/cache > available > Mem: 1883872 148932 72752 97156 1662188 1429468 > Swap: 2097148 65064 2032084 > > No OOM actions are recorded. > > > What specs are your servers ie cpu and memory, is it ecc memory? > > These are virtual machines with 4 allocated cores and 2GB of RAM. The > host systems are Intel(R) Xeon(R) CPU E5-2620 v3 with 64 of ECC RAM. > The two VMs running 389-ds are on different physical hosts, but have the > same problems at roughly the same frequency, at roughly the same uptime. > > > What kind of disk are they on? Are there issues in dmesg? > > One physical system has a RAID10 mdraid array of SAS disks. The other > has a RAID1 mdraid array of SAS disks. No errors have been recorded. > > The virtual machines are LVM-backed with standard (not sparse) LVs. That all sounds great, no issues there. > > > Have you configured system activity reporter (sar), and have out from > > the same time of disk io, memory usage, cpu etc? > > I believe that's set up by default, yes. > > https://paste.fedoraproject.org/483468/93401501/ It looks like there is some activity just as it stops. 12:00:01 AM proc/s cswch/s 12:10:01 AM 4.91 393.92 12:20:01 AM 4.93 385.64 12:30:01 AM 2.79 312.10 12:40:01 AM 1.57 320.06 12:50:01 AM 0.84 282.43 01:00:01 AM 0.23 217.87 01:10:01 AM 0.38 276.50 12:00:01 AM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 12:10:01 AM 0.27 51.65 2192.75 0.00 1020.59 0.00 0.00 0.00 0.00 12:20:01 AM 0.32 59.12 2193.29 0.00 1018.92 0.00 0.00 0.00 0.00 12:30:01 AM 0.04 124.12 1242.81 0.04 559.39 0.00 0.00 0.00 0.00 12:40:01 AM 1.77 23.27 685.10 0.00 324.02 0.00 0.00 0.00 0.00 12:50:01 AM 1.05 89.72 866.32 0.01 490.28 0.00 0.00 0.00 0.00 Then at 3:30 there is some more paging: 03:30:01 AM 29.34 171.17 472.71 0.11 289.21 0.00 0.00 0.00 0.00 There appears to be an increase in disk activity here too: 12:00:01 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 12:10:01 AM dev252-0 7.86 0.55 103.30 13.21 8.90 1132.98 15.29 12.02 12:10:01 AM dev253-0 0.05 0.52 0.00 9.75 0.01 238.28 20.59 0.11 12:10:01 AM dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:10:01 AM dev253-2 8.82 0.03 103.30 11.72 9.06 1027.71 13.63 12.01 12:20:01 AM dev252-0 9.48 0.64 118.24 12.55 12.15 1281.81 15.56 14.75 12:20:01 AM dev253-0 0.10 0.64 0.16 8.00 0.03 251.33 25.82 0.26 12:20:01 AM dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:20:01 AM dev253-2 10.50 0.00 118.08 11.25 12.30 1171.42 13.96 14.65 12:30:01 AM dev252-0 4.39 0.08 248.25 56.57 5.20 1183.77 27.41 12.03 12:30:01 AM dev253-0 0.20 0.08 3.30 16.87 0.04 199.68 31.60 0.63 12:30:01 AM dev253-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 12:30:01 AM dev253-2 5.27 0.00 244.94 46.48 5.37 1019.43 22.10 11.65 But the amount of activity is mirrored at 8:40pm, so it's probably okay. In generally, it doesn't look too busy, and the memory usage seems okay. No other process runs at 12:30, IE backups? > > The DS stopped responding at about 12:30AM in this readout (system time > is in UTC). > > > What's your sysctl setup like? > > Standard for a CentOS 7 system, with these additions: > > net.ipv4.ip_local_port_range = 1024 65000 > net.ipv4.tcp_keepalive_time = 600 > > > > Have you increased file descriptors for Directory Server? > > I thought I had, but it looks like I haven't: > > # cat /proc/sys/fs/file-max > 185059 > # grep nofile /etc/security/limits.conf > # - nofile - max number of open file descriptors > # grep ulimit /etc/profile > > > nsslapd-maxdescriptors: 1024 > > > Have you lowered the TCP close wait timeout? > > No. > > > When I hear of problems like this, I'm always inclined to investigate > > the host first, as there is a surprising amount that can affect DS from > > the host. > > I suspect so, too, since the problem correlates with the system uptime, > not how long the daemon has been running. But beyond that I'm not sure > how to track this down further. It may appear to be uptime, but may not. It would be worth checking ss/netstat when the issue occurs next, see how many sockets are blocked on CLOSE_WAIT, because it could be a resource exhaustion there which makes it appear as though it's slowing, but really, the threads are starved of work. I'd need to study your pstack outputs to be sure. Have you done any other Directory Server tuning (threads, memory cache?). I think the out of box defaults are conservative, but in this case probably aren't too bad for you. Might also be worth checking your VM hosts at the time the issues occur, see if you see a spike on other systems or storage. -- Sincerely, William Brown Software Engineer Red Hat, Brisbane
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ 389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx