Yeah, we¹re using SSL and TLS so ethereal/tcpdump isn¹t going to yield much info. The process hung again and strace didn¹t provide too much information other than this: futex(0x20b9260, FUTEX_WAIT, 2, NULL) Would that give you a place to start looking? -richard On 2/19/08 4:04 PM, "Rich Megginson" <rmeggins@xxxxxxxxxx> wrote: > Richard Hesse wrote: >> Not much new to report. The server hung again and the only thing in the >> error log with connection tracing is this: >> >> [18/Feb/2008:13:14:03 +0000] - PR_Write(41818752) Netscape Portable Runtime >> error -5961 (TCP connection reset by peer.) >> [18/Feb/2008:13:14:03 +0000] - ber_flush failed, error 104 (Connection reset >> by peer) >> >> Which doesn't look like much. > Well, it tells me that the server was attempting to write to a socket, > and got an error. -5961 is PR_CONNECT_RESET_ERROR which can occur if > the system call returns either EPIPE or ECONNRESET. And error 104 is > indeed ECONNRESET. > /usr/include/asm-generic/errno.h:#define ECONNRESET 104 > /* Connection reset by peer */ > > AFAICT, this can happen if the client shuts down the socket (for any > number of reasons) but the server is still attempting to send data. In > this case, the client will respond with a TCP RST. I'm not sure how or > why this could happen. I'm open to other causes for ECONNRESET. > What would be really, really interesting is if we could narrow this down > to a particular client application and run ethereal on the connection. > > Are you using SSL? >> As for network tuning, it's already been done. >> >> Max descriptors is set to 32768. >> >> Are there any gdb commands I can run while the server is in a hung state? >> > Sure. For whatever the cause of the ECONNRESET, it should not cause the > server to hang, and it would be interesting to find out what it's > doing. You'll have to install the fedora-ds-base-debuginfo package. > Attach to the process - gdb /usr/sbin/ns-slapd <pid of process> > Then, dump the thread stacks - > > (gdb) thread apply all bt > > If you want the output to go to a file, redirect gdb logging to a file > first before doing the thread apply e.g. > > (gdb) set logging on > (gdb) set logging file stack.txt > > >> I'm going to try running strace while the process is working, and hope for a >> hang. Maybe that will give us some more info. >> >> -richard >> >> On 2/19/08 10:23 AM, "Rich Megginson" <rmeggins@xxxxxxxxxx> wrote: >> >> >>> Richard Hesse wrote: >>> >>>> Yes, every host (except the ldap hosts) runs nscd. The ldap servers are not >>>> configured to use directory data for anything. >>>> >>>> >>> I just don't know. I've not seen this before. I suppose you could try >>> checking your kernel TCP/IP settings, and increasing the number of file >>> descriptors used - >>> http://directory.fedoraproject.org/wiki/Performance_Tuning >>> >>>> -richard >>>> >>>> >>>> On 2/15/08 2:11 PM, "Rich Megginson" <rmeggins@xxxxxxxxxx> wrote: >>>> >>>> >>>> >>>>> Richard Hesse wrote: >>>>> >>>>> >>>>>> nsswitch posix users/groups, >>>>>> >>>>>> >>>>> Are you using nscd? >>>>> >>>>> >>>>>> ssh, sudo, puppet (config management), and >>>>>> internally written applications. >>>>>> >>>>>> -richard >>>>>> >>>>>> On 2/15/08 12:53 PM, "Rich Megginson" <rmeggins@xxxxxxxxxx> wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> What is the application which is generating this load? >>>>>>> >>>>>>> >>>>>>> >>>>>> -- >>>>>> Fedora-directory-users mailing list >>>>>> Fedora-directory-users@xxxxxxxxxx >>>>>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>>>>> >>>>>> >>>>>> >>>> -- >>>> Fedora-directory-users mailing list >>>> Fedora-directory-users@xxxxxxxxxx >>>> https://www.redhat.com/mailman/listinfo/fedora-directory-users >>>> >>>> >>> >> >> >> -- >> Fedora-directory-users mailing list >> Fedora-directory-users@xxxxxxxxxx >> https://www.redhat.com/mailman/listinfo/fedora-directory-users >> > > -- Fedora-directory-users mailing list Fedora-directory-users@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/fedora-directory-users