[389-users] MMR issue ...

Rich Megginson rmeggins at redhat.com
Thu Nov 15 19:05:44 UTC 2012


On 11/15/2012 12:01 PM, Reinhard Nappert wrote:
>
> The system is cleaned up right now.  The next time, when it happens, I 
> hope I get this information.
>
> I am surprised that one session has more than 20.000 ops. There are 
> not many updates on the system…..
>

These are not just update operations, but all of the other search and 
replication extended operations.

> *From:*Rich Megginson [mailto:rmeggins at redhat.com]
> *Sent:* Thursday, November 15, 2012 1:36 PM
> *To:* Reinhard Nappert
> *Cc:* General discussion list for the 389 Directory server project.
> *Subject:* Re: [389-users] MMR issue ...
>
> On 11/15/2012 11:30 AM, Reinhard Nappert wrote:
>
> Rich,
>
> I also made the following observation:
>
> I see quite some open connections of the replication directory:
>
> connection: 84:20121114233917Z:5:5::cn=replication,cn=config
>
> connection: 85:20121113224528Z:3:3::cn= replication,cn=config
>
> connection: 88:20121112163205Z:3:3::cn= replication,cn=config
>
> connection: 89:20121113220034Z:4:4::cn= replication,cn=config
>
> connection: 90:20121115061343Z:209:209::cn= replication,cn=config
>
> connection: 91:20121114234513Z:23374:23373:r:cn= replication,cn=config
>
> connection: 92:20121114235823Z:9579:9579::cn= replication,cn=config
>
> connection: 93:20121115120212Z:1077:1076:r:cn= replication,cn=config
>
> I only see this on Server B. A and C are fine, meaning that the 
> session all disappear.
>
>
> If you do a netstat, what does it say are the states of these connections?
>
>
> Let me know, if you have any other idea what might be going on.
>
> Thanks,
>
> -Reinhard
>
> *From:*389-users-bounces at lists.fedoraproject.org 
> <mailto:389-users-bounces at lists.fedoraproject.org> 
> [mailto:389-users-bounces at lists.fedoraproject.org] *On Behalf Of 
> *Reinhard Nappert
> *Sent:* Tuesday, November 13, 2012 3:54 PM
> *To:* Rich Megginson
> *Cc:* General discussion list for the 389 Directory server project.
> *Subject:* Re: [389-users] MMR issue ...
>
> There are a lot of RST for the time frame, especially on Server A.
>
> I also see a lot TCP Retransmission for this particular replication 
> session. I also see a couple of TCP Dup ACKs. Is this enough to cause 
> the issue?
>
> Of course, I don’t have an idea, what is the cause of that!
>
> *From:*Rich Megginson [mailto:rmeggins at redhat.com]
> *Sent:* Tuesday, November 13, 2012 1:57 PM
> *To:* Reinhard Nappert
> *Cc:* General discussion list for the 389 Directory server project.
> *Subject:* Re: [389-users] MMR issue ...
>
> On 11/13/2012 11:53 AM, Reinhard Nappert wrote:
>
> How would you proceed to figure out what is going on there?
>
>
> Since it doesn't appear that the replication logs are giving enough 
> information, and you don't see any disconnects or TCP resets happening 
> in the packet capture, then I guess you have no choice but to 
> familiarize yourself with the source code and use gdb.
>
>
> You see that I ran out of ideas!
>
> Thanks
>
> *From:*Rich Megginson [mailto:rmeggins at redhat.com]
> *Sent:* Tuesday, November 13, 2012 1:32 PM
> *To:* Reinhard Nappert
> *Cc:* General discussion list for the 389 Directory server project.
> *Subject:* Re: [389-users] MMR issue ...
>
> On 11/13/2012 11:21 AM, Reinhard Nappert wrote:
>
> The 3 servers do not crash.
>
> I am not sure about the network, though. My first assumption was that 
> the firewall (between A and B) might cause the issue. The latest 
> occurrence (the one, I described) had the firewall removed. I see 
> quite some TCP Retransmissions in the packet captures. Could that be 
> the issue?
>
>
> That could be, although that would mean there are so many tcp 
> retransmissions that take such a long time to process that it causes 
> the application to think the network connection has timed out.
>
>
>
> -Reinhard
>
> *From:*Rich Megginson [mailto:rmeggins at redhat.com]
> *Sent:* Tuesday, November 13, 2012 1:15 PM
> *To:* General discussion list for the 389 Directory server project.
> *Cc:* Reinhard Nappert
> *Subject:* Re: [389-users] MMR issue ...
>
> On 11/13/2012 11:02 AM, Reinhard Nappert wrote:
>
> Rich,
>
> Do you know what the cause of this issue is?
>
>
> No, I don't know.
>
>
>
>
> You would expect that you saw this issue in different deployments, but 
> I only saw it in one instance.
>
> If it turns out that the issue I see is identical the issue, you 
> mentioned, I’d like to know, when it was fixed.
>
>
> Upon further investigation, this does not appear to be the same as 
> https://fedorahosted.org/389/ticket/374
>
> I'm not sure what the problem is.  I've seen timeouts when servers 
> crash or there are network issues.
>
>
>
>
> Thanks,
>
> -Reinhard
>
> *From:*389-users-bounces at lists.fedoraproject.org 
> <mailto:389-users-bounces at lists.fedoraproject.org> 
> [mailto:389-users-bounces at lists.fedoraproject.org] *On Behalf Of 
> *Reinhard Nappert
> *Sent:* Tuesday, November 13, 2012 12:22 PM
> *To:* Rich Megginson; General discussion list for the 389 Directory 
> server project.
> *Subject:* Re: [389-users] MMR issue ...
>
> I use 1.2.8.2
>
> *From:*Rich Megginson [mailto:rmeggins at redhat.com]
> *Sent:* Tuesday, November 13, 2012 12:18 PM
> *To:* General discussion list for the 389 Directory server project.
> *Cc:* Reinhard Nappert
> *Subject:* Re: [389-users] MMR issue ...
>
> On 11/13/2012 09:24 AM, Reinhard Nappert wrote:
>
> Hi,
>
> I’ve encountered issues with a MMR setup, which looks like the following:
>
>      A ------- B
>
>        \           /
>
>          \       /
>
>            \   /
>
>              C
>
> The replication works for approximately 24 hours. There are not many 
> changes to the content anyway. After about 1 day, the attribute  value 
> of the type “nsds5replicaLastUpdateStatus”  changes to “1 Can't 
> acquire busy replica “ of the replication agreement object from type 
> “nsDS5ReplicationAgreement”.  I see this message on C for the 
> agreement “C-to-B”.  The start-time of the last update is 01:08:33. 
>  When I check the status on B, it looks fine for “B-to-C” and 
> “B-to-A”, however, the start-time of the last update is stuck at 
> 01:08:36 for “B-to-C”, whereas A gets updated afterwards as well. I 
> don’t have the values for A!
>
> When, I check errors and access on the boxes, I see the following:
>
> Errors on A:
>
> [10/Nov/2012:01:19:31 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Warning: unable to receive endReplication extended operation 
> response (Timed out)
>
> [10/Nov/2012:01:25:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Can't contact LDAP server). Will retry 
> later.
>
> [10/Nov/2012:01:25:05 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth resumed
>
> [10/Nov/2012:02:26:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Timed out). Will retry later.
>
> [10/Nov/2012:02:31:55 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Can't contact LDAP server). Will retry 
> later.
>
> [10/Nov/2012:02:31:59 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth resumed
>
> [10/Nov/2012:02:43:36 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Timed out). Will retry later.
>
> [10/Nov/2012:03:03:00 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Timed out). Will retry later.
>
> [10/Nov/2012:03:08:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Can't contact LDAP server). Will retry 
> later.
>
> [10/Nov/2012:03:11:35 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:03:11:35 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 
> (Can't connect to the LDAP server) ((null))
>
> [10/Nov/2012:03:14:45 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:03:14:52 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth resumed
>
> [10/Nov/2012:03:33:29 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:03:33:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 
> (Can't connect to the LDAP server) ((null))
>
> [10/Nov/2012:03:43:29 -0300] slapi_ldap_bind - Error: timeout after 
> [0.0] seconds reading bind response for [cn=replication,cn=config] 
> mech [SIMPLE]
>
> [10/Nov/2012:03:43:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 
> (Timed out) ((null))
>
> [10/Nov/2012:03:46:39 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:03:46:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 
> (Can't connect to the LDAP server) ((null))
>
> [10/Nov/2012:03:46:42 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth resumed
>
> [10/Nov/2012:05:12:02 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Timed out). Will retry later.
>
> [10/Nov/2012:06:16:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Timed out). Will retry later.
>
> [10/Nov/2012:06:21:27 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Can't contact LDAP server). Will retry 
> later.
>
> [10/Nov/2012:06:21:46 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth resumed
>
> [10/Nov/2012:06:39:23 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Timed out). Will retry later.
>
> [10/Nov/2012:07:43:45 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Timed out). Will retry later.
>
> [10/Nov/2012:19:38:22 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Timed out). Will retry later.
>
> [10/Nov/2012:19:43:51 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Unable to receive the response for a startReplication 
> extended operation to consumer (Can't contact LDAP server). Will retry 
> later.
>
> [10/Nov/2012:19:43:54 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth resumed
>
> [10/Nov/2012:20:18:21 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:20:18:21 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 
> (Can't connect to the LDAP server) ((null))
>
> [10/Nov/2012:20:23:21 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:20:28:21 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:20:33:20 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:20:38:22 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:20:43:22 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:20:48:20 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:21:00:15 -0300] slapi_ldap_bind - Error: timeout after 
> [0.0] seconds reading bind response for [cn=replication,cn=config] 
> mech [SIMPLE]
>
> [10/Nov/2012:21:00:15 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 
> (Timed out) ((null))
>
> [10/Nov/2012:21:03:24 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:21:03:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 
> (Can't connect to the LDAP server) ((null))
>
> [10/Nov/2012:21:06:36 -0300] slapi_ldap_bind - Error: could not send 
> bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 
> (Can't connect to the LDAP server) -5961 (TCP connection reset by 
> peer.) 115 (Operation now in progress)
>
> [10/Nov/2012:21:16:39 -0300] slapi_ldap_bind - Error: timeout after 
> [0.0] seconds reading bind response for [cn=replication,cn=config] 
> mech [SIMPLE]
>
> [10/Nov/2012:21:16:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 
> (Timed out) ((null))
>
> [10/Nov/2012:21:24:14 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" 
> (B:389): Replication bind with SIMPLE auth resumed
>
> Errors on B:
>
> [10/Nov/2012:01:27:56 -0300] NSMMReplicationPlugin - agmt="cn=B-to-A" 
> (A:389): Replication bind with SIMPLE auth resumed
>
> Errors on C:
>
> No entry for the time
>
> I also analyzed the access files and see the following:
>
> From access on A:
>
> I see replication sessions from B and C, established at 01:00:10 and 
> closed at 01:01:14.
>
> Next replication session from C, established at 01:05:12 and closed at 
> 01:06:12. There was NO session from B.
>
> B and C establish the next replication session at 01:07:52. During the 
> session, a MOD is performed at 01:08:33. C closes this session at 
> 01:09:35. The session from B stays open, and I see a ABANDOM request 
> from A at 01:17:53. Eventually, the session gets closed (no unbind) at 
> 01:23:25.
>
> [10/Nov/2012:01:17:53 -0300] conn=1439 op=3 ABANDON targetop=NOTFOUND 
> msgid=3
>
> Replication sessions from C continue to work fine.
>
> There is one more replication session from B, established at 01:27:56 
> and closed at 01:28:38. After this session, B does not establish any 
> replication session anymore.
>
> From access on B:
>
> I see replication sessions from A and C, established at 01:00:09 and 
> closed at 01:01:14.
>
> Next replication session from C, established at 01:05:12 and closed at 
> 01:06:12. There was NO session from A.
>
> A and C establish the next replication session at 01:07:52. During the 
> session, a MOD is performed at 01:08:33. C closes this session at 
> 01:09:35. The session from A stays open forever!!!
>
> Replication sessions from C continue to work fine.
>
> From access on C:
>
> I see replication sessions from A and B, established at 01:00:09 and 
> closed at 01:01:14.
>
> I don’t see any replication session from the 01:05 time-slot
>
> A and B establish the next replication session at 01:07:53. During the 
> session, a MOD is performed at 01:08:33. A closes this session at 
> 01:09:33 and B closes session at 01:09:38
>
> From now on, I only see replication sessions from A!
>
> Has anybody seen this kind of behavior? Any feedback is highly 
> appreciated.
>
> This looks like https://fedorahosted.org/389/ticket/374
>
> What version of 389-ds-base?
>
> Thanks,
>
> -Reinhard
>
>
>
>
>
>
> --
> 389 users mailing list
> 389-users at lists.fedoraproject.org  <mailto:389-users at lists.fedoraproject.org>
> https://admin.fedoraproject.org/mailman/listinfo/389-users
>
>
>
>
>
>
> --
> 389 users mailing list
> 389-users at lists.fedoraproject.org  <mailto:389-users at lists.fedoraproject.org>
> https://admin.fedoraproject.org/mailman/listinfo/389-users
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/389-users/attachments/20121115/b25ac441/attachment.html>


More information about the 389-users mailing list