Rich, I also made the following observation: I see quite some open connections of the replication directory: connection: 84:20121114233917Z:5:5::cn=replication,cn=config connection: 85:20121113224528Z:3:3::cn= replication,cn=config connection: 88:20121112163205Z:3:3::cn= replication,cn=config connection: 89:20121113220034Z:4:4::cn= replication,cn=config connection: 90:20121115061343Z:209:209::cn= replication,cn=config connection: 91:20121114234513Z:23374:23373:r:cn= replication,cn=config connection: 92:20121114235823Z:9579:9579::cn= replication,cn=config connection: 93:20121115120212Z:1077:1076:r:cn= replication,cn=config I only see this on Server B. A and C are fine, meaning that the session all disappear. Let me know, if you have any other idea what might be going on. Thanks, -Reinhard From: 389-users-bounces@xxxxxxxxxxxxxxxxxxxxxxx [mailto:389-users-bounces@xxxxxxxxxxxxxxxxxxxxxxx]
On Behalf Of Reinhard Nappert There are a lot of RST for the time frame, especially on Server A.
I also see a lot TCP Retransmission for this particular replication session. I also see a couple of TCP Dup ACKs. Is this enough to cause the issue? Of course, I don’t have an idea, what is the cause of that! From: Rich Megginson [mailto:rmeggins@xxxxxxxxxx]
On 11/13/2012 11:53 AM, Reinhard Nappert wrote: How would you proceed to figure out what is going on there?
You see that I ran out of ideas! Thanks From: Rich Megginson [mailto:rmeggins@xxxxxxxxxx]
On 11/13/2012 11:21 AM, Reinhard Nappert wrote: The 3 servers do not crash. I am not sure about the network, though. My first assumption was that the firewall (between A and B) might cause the issue. The latest occurrence (the one, I described) had the firewall removed. I see quite some
TCP Retransmissions in the packet captures. Could that be the issue?
-Reinhard From: Rich Megginson [mailto:rmeggins@xxxxxxxxxx]
On 11/13/2012 11:02 AM, Reinhard Nappert wrote: Rich, Do you know what the cause of this issue is?
You would expect that you saw this issue in different deployments, but I only saw it in one instance.
If it turns out that the issue I see is identical the issue, you mentioned, I’d like to know, when it was fixed.
Thanks, -Reinhard From:
389-users-bounces@xxxxxxxxxxxxxxxxxxxxxxx [mailto:389-users-bounces@xxxxxxxxxxxxxxxxxxxxxxx]
On Behalf Of Reinhard Nappert I use 1.2.8.2 From: Rich Megginson [mailto:rmeggins@xxxxxxxxxx]
On 11/13/2012 09:24 AM, Reinhard Nappert wrote: Hi, I’ve encountered issues with a MMR setup, which looks like the following: A ------- B \ / \ / \ / C The replication works for approximately 24 hours. There are not many changes to the content anyway. After about 1 day, the attribute value of the type “nsds5replicaLastUpdateStatus” changes to “1 Can't acquire busy replica “ of the replication
agreement object from type “nsDS5ReplicationAgreement”. I see this message on C for the agreement “C-to-B”. The start-time of the last update is 01:08:33. When I check the status on B, it looks fine for “B-to-C” and “B-to-A”, however, the start-time of
the last update is stuck at 01:08:36 for “B-to-C”, whereas A gets updated afterwards as well. I don’t have the values for A! When, I check errors and access on the boxes, I see the following: Errors on A: [10/Nov/2012:01:19:31 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Warning: unable to receive endReplication extended operation response (Timed out) [10/Nov/2012:01:25:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later. [10/Nov/2012:01:25:05 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed [10/Nov/2012:02:26:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [10/Nov/2012:02:31:55 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later. [10/Nov/2012:02:31:59 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed [10/Nov/2012:02:43:36 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [10/Nov/2012:03:03:00 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [10/Nov/2012:03:08:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later. [10/Nov/2012:03:11:35 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:03:11:35 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null)) [10/Nov/2012:03:14:45 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:03:14:52 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed [10/Nov/2012:03:33:29 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:03:33:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null)) [10/Nov/2012:03:43:29 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE] [10/Nov/2012:03:43:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null)) [10/Nov/2012:03:46:39 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:03:46:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null)) [10/Nov/2012:03:46:42 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed [10/Nov/2012:05:12:02 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [10/Nov/2012:06:16:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [10/Nov/2012:06:21:27 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later. [10/Nov/2012:06:21:46 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed [10/Nov/2012:06:39:23 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [10/Nov/2012:07:43:45 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [10/Nov/2012:19:38:22 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [10/Nov/2012:19:43:51 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later. [10/Nov/2012:19:43:54 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed [10/Nov/2012:20:18:21 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:20:18:21 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null)) [10/Nov/2012:20:23:21 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:20:28:21 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:20:33:20 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:20:38:22 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:20:43:22 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:20:48:20 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:21:00:15 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE] [10/Nov/2012:21:00:15 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null)) [10/Nov/2012:21:03:24 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:21:03:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null)) [10/Nov/2012:21:06:36 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [10/Nov/2012:21:16:39 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE] [10/Nov/2012:21:16:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null)) [10/Nov/2012:21:24:14 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed Errors on B: [10/Nov/2012:01:27:56 -0300] NSMMReplicationPlugin - agmt="cn=B-to-A" (A:389): Replication bind with SIMPLE auth resumed Errors on C: No entry for the time I also analyzed the access files and see the following: From access on A: I see replication sessions from B and C, established at 01:00:10 and closed at 01:01:14. Next replication session from C, established at 01:05:12 and closed at 01:06:12. There was NO session from B. B and C establish the next replication session at 01:07:52. During the session, a MOD is performed at 01:08:33. C closes this session at 01:09:35. The session from B stays open, and I see a ABANDOM request from A at 01:17:53. Eventually,
the session gets closed (no unbind) at 01:23:25. [10/Nov/2012:01:17:53 -0300] conn=1439 op=3 ABANDON targetop=NOTFOUND msgid=3 Replication sessions from C continue to work fine. There is one more replication session from B, established at 01:27:56 and closed at 01:28:38. After this session, B does not establish any replication session anymore. From access on B: I see replication sessions from A and C, established at 01:00:09 and closed at 01:01:14. Next replication session from C, established at 01:05:12 and closed at 01:06:12. There was NO session from A. A and C establish the next replication session at 01:07:52. During the session, a MOD is performed at 01:08:33. C closes this session at 01:09:35. The session from A stays open forever!!! Replication sessions from C continue to work fine. From access on C: I see replication sessions from A and B, established at 01:00:09 and closed at 01:01:14. I don’t see any replication session from the 01:05 time-slot A and B establish the next replication session at 01:07:53. During the session, a MOD is performed at 01:08:33. A closes this session at 01:09:33 and B closes session at 01:09:38 From now on, I only see replication sessions from A! Has anybody seen this kind of behavior? Any feedback is highly appreciated. This looks like
https://fedorahosted.org/389/ticket/374 Thanks, -Reinhard
-- 389 users mailing list 389-users@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/389-users
-- 389 users mailing list 389-users@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/389-users |
-- 389 users mailing list 389-users@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/389-users