The MRDs were up but not responding, due to full log filesystems.
The xproxies would time out on an ldapsearch() call, and then
try to reconnect. They would then hang in an ldapbind() call.
Chris
Shufei Wen wrote:
Chris,
I might have missed your point. What's the conclusion of the root cause
of the problem. Were the mrds up or down at the time?
First you said xproxy processes hang when mrd were shutdown and latter
you indicated that mrd returns ACKs?
Thanks,
Shufei
Chris Eastlund wrote:
Steve,
The problem was with the MRD, which filled the log filesystem, which
halted the MRD. Both MRD machines filled their logs.
The proxy processes were hung waiting on the MRD. When the MRD was
shut down, the proxy processes all recorded an MRD bind failure and a
lot of sessions (5000) with sl=63000 or so. And then logins started
again.
The question is why any proxy process kept trying a mrd query
for 63,000 seconds or 17.5 hours.
The ldap search call has a timeout (default for m2k of 90 seconds)
and gets tried twice by xproxy, and once (for timeout fails)
in the libstdxdir library. This should take 3 minutes, max.
After the search fails, the proxy will attempt to close and reopen the
session. That's where I think things hang. The MRD system returns
an ACK, so the connection seems up and the connection timeout doesn't
apply.
When ldapsearch is run against such a listener, truss shows:
connect() # returns EINPROGRESS
pollsys()
time()
write(4,....) # seems to be the login sequence
pollsys(0xFFBFF0E8,5,0,0) #
I think this means a poll() call with no timeout. I can't find a
pollsys() man page, as it's a Solaris internal call.
There are web pages noting this problem from about 2002, and our
version of openldap is older than that.
Chris
Steve Prisco wrote:
including George
Steve
------------------------------------------------------------------------
*From:* Al Robinson [mailto:awr@xxxxxxxxxxxxxxxxxxx]
*Sent:* Thursday, January 08, 2009 3:39 PM
*To:* 'M2K Development Team'
*Cc:* Mail Testers; PRISCO, STEVE (ATTLABS)
*Subject:* PXC02 OS patch
Patrick was running a load test on lzfwpxc02 last night and it was
running fine until it wasn't.
It currently thinks all pop proxy processes on the blpop interface
are busy. At least that's
what the logs are saying and the response from mailman when a new
connection is attempted.
The offered load was 2161 simultaneous sessions. For most of the
night, mailman reported
hiwater at approximately 3500/5000. At 4:32 it reported a hiwater of
4439/5000 followed by
an XSFLOOD at 4:34. At this point, it doesn't seem to respond any
more. Subsequent XSTAT
logs with the load still active report hiwater of 0/5000 and 500k+
xnconns. The latest XSTAT showed a hiwater of 0/5000 with 347 xnconns.
We need development to look at the server and determine if it is a
mailman problem or a problem with the OS patch.
I assume a core file will be needed, but we haven't touched the
system yet.
Al Robinson
_______________________________________________
Ietf@xxxxxxxx
https://www.ietf.org/mailman/listinfo/ietf