Hi,
I have the following 389 DS version deployed: 389-Directory/1.2.8.2 B2011.130.190
I have a 3 box multi-master replication setup in a ring:
\ / \ / \ / \ / \ / ... C ----- A ----- B ----- C ----- A ... / \ / \ / \ / \ / \
The replication agreements for "A" and "C" and for "B" and "C" work fine, but I have an issue for the agreements for the "A" and "B" connection.
I see the following in the errors file:
Server A: [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=base": Begin incremental protocol [19/Jul/2012:07:28:50 -0300] - csngen_adjust_time: gen state before 5007e1610000:1342693727:0:2 [19/Jul/2012:07:28:50 -0300] - _csngen_adjust_local_time: gen state before 5007e1610000:1342693727:0:2 [19/Jul/2012:07:28:50 -0300] - _csngen_adjust_local_time: gen state after 5007e1640000:1342693730:0:2 [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=BASE": Replica in use locking_purl=conn=7831 id=3 [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 replica="o=BASE": Unable to acquire replica: error: replica busy locked by conn=7831 id=3 for incremental update [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=umc": StartNSDS90ReplicationRequest: response=1 rc=0
This kind of error is logged in an interval of about 1 second, where the local_time differs 5007e1610000:1342693727:0:2
Server B: [19/Jul/2012:13:28:48 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [19/Jul/2012:13:34:17 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later. [19/Jul/2012:13:44:25 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE] [19/Jul/2012:13:44:25 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null)) [19/Jul/2012:13:44:25 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth resumed
Sometimes, I also see the following error [20/Jul/2012:11:28:39 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn= replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [20/Jul/2012:11:28:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null)) [20/Jul/2012:11:30:30 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth resumed
I don't see any indication that Server B was down at that time.
I did see the Bug 571677 (https://bugzilla.redhat.com/show_bug.cgi?id=571677), but there was no deletion of a replicaconflict object.
Did anybody encounter this kind of issue? The next question would be: How to recover the MMR environment.
Thanks, -Reinhard
Has somebody seen this problem as well?
-Reinhard
From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Reinhard Nappert Sent: Friday, August 03, 2012 2:51 PM To: 389-users@lists.fedoraproject.org Subject: [389-users] MMR issue
Hi,
I have the following 389 DS version deployed: 389-Directory/1.2.8.2 B2011.130.190
I have a 3 box multi-master replication setup in a ring:
\ / \ / \ / \ / \ / ... C ----- A ----- B ----- C ----- A ... / \ / \ / \ / \ / \
The replication agreements for "A" and "C" and for "B" and "C" work fine, but I have an issue for the agreements for the "A" and "B" connection.
I see the following in the errors file:
Server A: [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=base": Begin incremental protocol [19/Jul/2012:07:28:50 -0300] - csngen_adjust_time: gen state before 5007e1610000:1342693727:0:2 [19/Jul/2012:07:28:50 -0300] - _csngen_adjust_local_time: gen state before 5007e1610000:1342693727:0:2 [19/Jul/2012:07:28:50 -0300] - _csngen_adjust_local_time: gen state after 5007e1640000:1342693730:0:2 [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=BASE": Replica in use locking_purl=conn=7831 id=3 [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 replica="o=BASE": Unable to acquire replica: error: replica busy locked by conn=7831 id=3 for incremental update [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=base": StartNSDS90ReplicationRequest: response=1 rc=0
This kind of error is logged in an interval of about 1 second, where the local_time differs 5007e1610000:1342693727:0:2
Server B: [19/Jul/2012:13:28:48 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [19/Jul/2012:13:34:17 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later. [19/Jul/2012:13:44:25 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE] [19/Jul/2012:13:44:25 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null)) [19/Jul/2012:13:44:25 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth resumed
Sometimes, I also see the following error [20/Jul/2012:11:28:39 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn= replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [20/Jul/2012:11:28:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null)) [20/Jul/2012:11:30:30 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth resumed
I don't see any indication that Server B was down at that time.
I did see the Bug 571677 (https://bugzilla.redhat.com/show_bug.cgi?id=571677), but there was no deletion of a replicaconflict object.
Did anybody encounter this kind of issue? The next question would be: How to recover the MMR environment.
Thanks, -Reinhard
Hi
I must say this ldap replication connections look quite unusual. Can you provide more information about: - type of replication servers? Some servers i guest are masters and some are maybe slaves? - Does errors occur when you try to initiate replication manually?
Some errors suggests that there maybe other replication/ldap operations in progress, then target server sends message about lockout:
[19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=BASE": Replica in use locking_purl=conn=7831 id=3
[19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 replica="o=BASE": Unable to acquire replica: error: replica busy locked by conn=7831 id=3 for incremental update [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=base": StartNSDS90ReplicationRequest: response=1 rc=0
Other error suggest that there mey be no connection between servers. Maybe target server is to busy to respond or maybe network/firewall problem:
[19/Jul/2012:13:28:48 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [19/Jul/2012:13:34:17 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later. (...)
Please provide infromation about replication types. Try manually initiated replication and monitor logs carefully. This may provide more information. If you want to push updates from one server to others, then please consider using multi-master connections and hub server (look in red hat docs for more details)
Greg.
2012/8/7 Reinhard Nappert rnappert@juniper.net
Has somebody seen this problem as well?****
-Reinhard****
*From:* 389-users-bounces@lists.fedoraproject.org [mailto: 389-users-bounces@lists.fedoraproject.org] *On Behalf Of *Reinhard Nappert *Sent:* Friday, August 03, 2012 2:51 PM *To:* 389-users@lists.fedoraproject.org *Subject:* [389-users] MMR issue****
Hi,****
I have the following 389 DS version deployed: 389-Directory/1.2.8.2B2011.130.190
I have a 3 box multi-master replication setup in a ring: ****
\ / \ / \ / \ /
\ / ****
… C ----- A ----- B ----- C ----- A …**** / \ / \ / \ / \
/ ****
The replication agreements for “A” and “C” and for “B” and “C” work fine, but I have an issue for the agreements for the “A” and “B” connection.****
I see the following in the errors file:****
Server A:****
[19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=base": Begin incremental protocol****
[19/Jul/2012:07:28:50 -0300] - csngen_adjust_time: gen state before 5007e1610000:1342693727:0:2****
[19/Jul/2012:07:28:50 -0300] - _csngen_adjust_local_time: gen state before 5007e1610000:1342693727:0:2****
[19/Jul/2012:07:28:50 -0300] - _csngen_adjust_local_time: gen state after 5007e1640000:1342693730:0:2****
[19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=BASE": Replica in use locking_purl=conn=7831 id=3****
[19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 replica="o=BASE": Unable to acquire replica: error: replica busy locked by conn=7831 id=3 for incremental update****
[19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=base": StartNSDS90ReplicationRequest: response=1 rc=0****
This kind of error is logged in an interval of about 1 second, where the local_time differs 5007e1610000:1342693727:0:2****
Server B:****
[19/Jul/2012:13:28:48 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.****
[19/Jul/2012:13:34:17 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.****
[19/Jul/2012:13:44:25 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]
[19/Jul/2012:13:44:25 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))****
[19/Jul/2012:13:44:25 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth resumed****
Sometimes, I also see the following error****
[20/Jul/2012:11:28:39 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn= replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)****
[20/Jul/2012:11:28:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))****
[20/Jul/2012:11:30:30 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth resumed****
I don’t see any indication that Server B was down at that time.****
I did see the Bug 571677 ( https://bugzilla.redhat.com/show_bug.cgi?id=571677), but there was no deletion of a replicaconflict object.****
Did anybody encounter this kind of issue? The next question would be: How to recover the MMR environment.****
Thanks,****
-Reinhard****
-- 389 users mailing list 389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users
All of those servers are masters. This is a multi master environment.
Your point of having a firewall in between the servers is a good one! I don't have any access to the deployment, though. It is worth it, to investigate.
Thanks -Reinhard
From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Grzegorz Dwornicki Sent: Tuesday, August 07, 2012 1:12 PM To: General discussion list for the 389 Directory server project. Subject: Re: [389-users] MMR issue
Hi
I must say this ldap replication connections look quite unusual. Can you provide more information about: - type of replication servers? Some servers i guest are masters and some are maybe slaves? - Does errors occur when you try to initiate replication manually?
Some errors suggests that there maybe other replication/ldap operations in progress, then target server sends message about lockout: [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=BASE": Replica in use locking_purl=conn=7831 id=3 [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 replica="o=BASE": Unable to acquire replica: error: replica busy locked by conn=7831 id=3 for incremental update [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=base": StartNSDS90ReplicationRequest: response=1 rc=0
Other error suggest that there mey be no connection between servers. Maybe target server is to busy to respond or maybe network/firewall problem: [19/Jul/2012:13:28:48 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [19/Jul/2012:13:34:17 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later. (...)
Please provide infromation about replication types. Try manually initiated replication and monitor logs carefully. This may provide more information. If you want to push updates from one server to others, then please consider using multi-master connections and hub server (look in red hat docs for more details)
Greg. 2012/8/7 Reinhard Nappert <rnappert@juniper.netmailto:rnappert@juniper.net> Has somebody seen this problem as well?
-Reinhard
From: 389-users-bounces@lists.fedoraproject.orgmailto:389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.orgmailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Reinhard Nappert Sent: Friday, August 03, 2012 2:51 PM To: 389-users@lists.fedoraproject.orgmailto:389-users@lists.fedoraproject.org Subject: [389-users] MMR issue
Hi,
I have the following 389 DS version deployed: 389-Directory/1.2.8.2http://1.2.8.2 B2011.130.190
I have a 3 box multi-master replication setup in a ring:
\ / \ / \ / \ / \ / ... C ----- A ----- B ----- C ----- A ... / \ / \ / \ / \ / \
The replication agreements for "A" and "C" and for "B" and "C" work fine, but I have an issue for the agreements for the "A" and "B" connection.
I see the following in the errors file:
Server A: [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=base": Begin incremental protocol [19/Jul/2012:07:28:50 -0300] - csngen_adjust_time: gen state before 5007e1610000:1342693727:0:2 [19/Jul/2012:07:28:50 -0300] - _csngen_adjust_local_time: gen state before 5007e1610000:1342693727:0:2 [19/Jul/2012:07:28:50 -0300] - _csngen_adjust_local_time: gen state after 5007e1640000:1342693730:0:2 [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=BASE": Replica in use locking_purl=conn=7831 id=3 [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 replica="o=BASE": Unable to acquire replica: error: replica busy locked by conn=7831 id=3 for incremental update [19/Jul/2012:07:28:50 -0300] NSMMReplicationPlugin - conn=7835 op=160267 repl="o=base": StartNSDS90ReplicationRequest: response=1 rc=0
This kind of error is logged in an interval of about 1 second, where the local_time differs 5007e1610000:1342693727:0:2
Server B: [19/Jul/2012:13:28:48 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later. [19/Jul/2012:13:34:17 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later. [19/Jul/2012:13:44:25 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE] [19/Jul/2012:13:44:25 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null)) [19/Jul/2012:13:44:25 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth resumed
Sometimes, I also see the following error [20/Jul/2012:11:28:39 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn= replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress) [20/Jul/2012:11:28:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null)) [20/Jul/2012:11:30:30 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (A:389): Replication bind with SIMPLE auth resumed
I don't see any indication that Server B was down at that time.
I did see the Bug 571677 (https://bugzilla.redhat.com/show_bug.cgi?id=571677), but there was no deletion of a replicaconflict object.
Did anybody encounter this kind of issue? The next question would be: How to recover the MMR environment.
Thanks, -Reinhard
-- 389 users mailing list 389-users@lists.fedoraproject.orgmailto:389-users@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/389-users
389-users@lists.fedoraproject.org