process disappearing, replication failing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


On 02/02/2011 08:48 PM, Andrew Kerr wrote:
> The one replica still running hasn't crashed since it stopped getting traffic, but it is still getting replicated to.  Too early to determine if that has any relevance.
> My single master is still on, and has been stable.  It gets the same portion and type of end-user traffic, plus some.  So something still makes me think this has something to do with the master sending bad/mishandled data to the replicas, or something along those lines.  Not based on anything other and educated guesswork though.
> The lines in the error log don't have anything unusual in them.  Same thing each time it dies.  The referrals error seems to be ongoing, haven't looked in to that yet, but assume it isn't related.
> [02/Feb/2011:09:52:26 -0500] NSMMReplicationPlugin - multimaster_be_state_change: replica dc=simplewire,dc=com is coming online; enabling replication
> [02/Feb/2011:09:52:26 -0500] NSMMReplicationPlugin - repl_set_mtn_referrals: could not set referrals for replica dc=simplewire,dc=com: 32
> [02/Feb/2011:09:52:26 -0500] NSMMReplicationPlugin - repl_set_mtn_referrals: could not set referrals for replica dc=simplewire,dc=com: 32
> [02/Feb/2011:10:05:38 -0500] NSMMReplicationPlugin - repl_set_mtn_referrals: could not set referrals for replica dc=simplewire,dc=com: 32
> 	389-Directory/ B2010.350.198
> 	vdc-prd-ldap-001.simplewire.com:389 (/etc/dirsrv/slapd-vdc-prd-ldap-001)
> [02/Feb/2011:11:50:36 -0500] - 389-Directory/ B2010.350.198 starting up
> [02/Feb/2011:11:50:36 -0500] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
> [02/Feb/2011:11:50:36 -0500] - slapd started.  Listening on All Interfaces port 389 for LDAP requests

This sure looks like a crash.  If you are able, I would appreciate it if 
you could follow the steps to enable core files at 
> -----Original Message-----
> From: Rich Megginson [mailto:rmeggins at redhat.com]
> Sent: Wednesday, February 02, 2011 1:14 PM
> To: Andrew Kerr
> Cc: General discussion list for the 389 Directory server project.
> Subject: Re: process disappearing, replication failing
> On 02/02/2011 10:37 AM, Andrew Kerr wrote:
>> I reinstalled the two replicas that were saying "No such object" and now they work - same exact cut-and-paste process that didn't work before.
>> The good news is that I am back up and running (phew, what a morning!).
>> I left one replica on, disabled behind our load balancer, so it is getting replicated to but no production traffic - with the intent of helping figure out what the problem is before others find it.  I'll get a bug report filed since this seems like something new.
>> FYI, these are all virtual machines (on a mix of vmware, kvm, and xen depending on the datacenter) and have very minimal installs, running no other apps, with no selinux or anything either.
> Is the server still crashing?  If so, please post the last few
> lines of the errors log before the crash.
> See also here:
> http://directory.fedoraproject.org/wiki/FAQ#Debugging_Crashes
>> -----Original Message-----
>> From: 389-users-bounces at lists.fedoraproject.org [mailto:389-users-bounces at lists.fedoraproject.org] On Behalf Of Andrew Kerr
>> Sent: Wednesday, February 02, 2011 11:44 AM
>> To: Rich Megginson; General discussion list for the 389 Directory server project.
>> Subject: Re: process disappearing, replication failing
>> The process is completely gone.  Doesn't show up in ps, and the pid referenced in the pid file doesn't exist.
>> I do have a lot of lines like this in my access log:
>> [02/Feb/2011:10:05:06 -0500] conn=4479 op=-1 fd=161 closed - B1
>> On the positive side, I was able to get some of the replicas downgraded to 1.2.4.  I had been deleting the server from the site under netscaproot and re-registering, but I hadn't re-created the replication agreement, I was just re-initializing the existing one.  Deleting it and creating a new one got rid of the error: "Unable to parse the response to the startReplication extended operation.  Replication is aborting".
>> Four of the six systems I put back to 1.2.4 (by removing the RPMs and blowing away all dirsrv relics left behind, reinstalling, and re-configuring).  Two of them I initialize and can see the directory, but when I do an ldapsearch remotely I get "result: 32 No such object".  More random/unpredictable behavior...
>> -----Original Message-----
>> From: Rich Megginson [mailto:rmeggins at redhat.com]
>> Sent: Wednesday, February 02, 2011 11:10 AM
>> To: General discussion list for the 389 Directory server project.
>> Cc: Andrew Kerr
>> Subject: Re: process disappearing, replication failing
>> On 02/02/2011 09:06 AM, Andrew Kerr wrote:
>>> I'm running a single master with 13 replicas, all CentOS 5.5.  The master, and a few of the slaves, are running  We were previously on 1.2.4, with most replicas still on that version.
>> You might be running into https://bugzilla.redhat.com/show_bug.cgi?id=668619
>> The symptom of that bug is your server will just stop responding to
>> requests, including server-to-server requests like replication.  Your
>> server will still be running.
>> Does ps -ef|grep slapd show your server process is running?
>> Do you see the messages like "op=-1 fd=66 closed - T2" in your access log?
>>> All of a sudden, the replicas slapd process had just started to disappear.  Nothing in the error log with level at 8192.  Its just gone.  I can start it up and it'll last about 5 minutes.  Replication is what seems to be breaking - it seems to go away right after an update.
>>> I've tried rolling the replicas back to 1.2.4, but when I initialize the consumers I get "Unable to parse the response to the startReplication extended operation.  Replication is aborting".
>>> Any suggestions on where to go from this point?  It seems is HIGHLY unstable.  But it seems it can't initialize 1.2.4 replicas (??), or maybe it just doesn't work at all.
>>> I'm not sure what the safe way is to roll back the master from, can I use "yum downgrade" safely?  At least now my  master and the replicas on 1.2.4 are working, I don't want to risk completely taking down ldap.
>>> Is there a good stable version I ought to be at?  I upgraded from 1.2.4 because of a number of other bugs, although none of them as bad as seems to be.
>>> Thanks - any help is greatly appreciated.
>>> This message and the information contained herein is proprietary and confidential and subject to the Amdocs policy statement,
>>> you may review at http://www.amdocs.com/email_disclaimer.asp
>>> --
>>> 389 users mailing list
>>> 389-users at lists.fedoraproject.org
>>> https://admin.fedoraproject.org/mailman/listinfo/389-users
>> --
>> 389 users mailing list
>> 389-users at lists.fedoraproject.org
>> https://admin.fedoraproject.org/mailman/listinfo/389-users

[Index of Archives]     [Fedora User Discussion]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [Fedora News]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Maintainers]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Legacy]     [Fedora Desktop]     [Fedora Fonts]     [ATA RAID]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Centos]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora QA]     [Fedora Triage]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Tux]     [Yosemite News]     [Yosemite Photos]     [Linux Apps]     [Maemo Users]     [Gnome Users]     [KDE Users]     [Fedora Tools]     [Fedora Art]     [Fedora Docs]     [Maemo Users]     [Asterisk PBX]     [Fedora Sparc]     [Fedora Universal Network Connector]     [Fedora ARM]

  Powered by Linux