[389-users] 389ds v.1.3.2.24 replication deadlocks/retry count exceeded

Rich Megginson rmeggins at redhat.com
Wed Nov 12 17:59:25 UTC 2014


On 11/12/2014 10:55 AM, Ivanov Andrey (M.) wrote:
>
>
>
>
>
> ------------------------------------------------------------------------
>
>     *De: *"Ivanov Andrey (M.)" <andrey.ivanov at polytechnique.fr>
>     *À: *"General discussion list for the 389 Directory server
>     project." <389-users at lists.fedoraproject.org>
>     *Envoyé: *Mercredi 12 Novembre 2014 18:52:44
>     *Objet: *[389-users] 389ds v.1.3.2.24 replication deadlocks/retry
>     count        exceeded
>
>     Hi,
>
>     I've continued testing 389ds v.1.3.2.24 on CentOS 7. I really have
>     an impression that everything works fine (plugins etc) but the
>     replication seems to be a little fragile. Both of the tickets i've
>     already opened concern replication partially or completely
>     (https://fedorahosted.org/389/ticket/47942 and
>     https://fedorahosted.org/389/ticket/47950).
>
>     Here is another issue with replication :
>     i have two servers with multi-master agreements on each of them
>     (the same configuration as in ticket
>     https://fedorahosted.org/389/ticket/47942).
>
>     We add/delete a lot of groups (943, to be exact). Each group may
>     contain a large number of referenced entries, up to ~250
>     (uniqueMember: dn). MemberOf plugin is activated and works fine.
>     Referential integrity plugin is also activated but of course it is
>     of any sense only when deleting groups (or renaming them). It goes
>     on for a long time (20-30 minutes or more). Some time after the
>     beginning of the operations (typically 5-8 minutes) we have
>     replication erros and inconsistency of the replica concerning the
>     entries mentioned in error log.
>
>     When adding and deleting groups the supplier is ok. Howevere the
>     consumer has several (from one to four or five) groupe
>     deletions/adds that are not replicated. The error on the supplier:
>
>     [12/Nov/2014:16:46:42 +0100] NSMMReplicationPlugin -
>     agmt="cn=Replication from ldap-edev.polytechnique.fr to
>     ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to
>     replay change (uniqueid fa90219d-6a8211e4-a42c901a-94623bee, CSN
>     546380d6000000020000): Operations error (1). Will retry later.
>     [12/Nov/2014:16:47:55 +0100] NSMMReplicationPlugin -
>     agmt="cn=Replication from ldap-edev.polytechnique.fr to
>     ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to
>     replay change (uniqueid 1e5367ae-6a8311e4-a42c901a-94623bee, CSN
>     54638125000000020000): Operations error (1). Will retry later.
>     [12/Nov/2014:16:53:14 +0100] NSMMReplicationPlugin -
>     agmt="cn=Replication from ldap-edev.polytechnique.fr to
>     ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to
>     replay change (uniqueid f4e70b85-6a8311e4-a42c901a-94623bee, CSN
>     54638260000000020000): Operations error (1). Will retry later.
>     [12/Nov/2014:16:55:12 +0100] NSMMReplicationPlugin -
>     agmt="cn=Replication from ldap-edev.polytechnique.fr to
>     ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to
>     replay change (uniqueid 3c6d978a-6a8411e4-a42c901a-94623bee, CSN
>     546382d6000400020000): Operations error (1). Will retry later.
>     [12/Nov/2014:16:56:31 +0100] NSMMReplicationPlugin -
>     agmt="cn=Replication from ldap-edev.polytechnique.fr to
>     ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to
>     replay change (uniqueid 6030dd93-6a8411e4-a42c901a-94623bee, CSN
>     54638325000000020000): Operations error (1). Will retry later.
>     [12/Nov/2014:16:57:22 +0100] NSMMReplicationPlugin -
>     agmt="cn=Replication from ldap-edev.polytechnique.fr to
>     ldap-model.polytechnique.fr" (ldap-model:636): Consumer failed to
>     replay change (uniqueid 83f42395-6a8411e4-a42c901a-94623bee, CSN
>     5463835d000000020000): Operations error (1). Will retry later.
>
>     The corresponding errors on the consumer seem to hint deadlocks in
>     these cases:
>     [12/Nov/2014:16:46:41 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: retry (49) the transaction
>     (csn=546380d6000000020000) failed (rc=-30993 (BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>     [12/Nov/2014:16:46:41 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: failed to write entry with csn
>     (546380d6000000020000); db error - -30993 BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>     [12/Nov/2014:16:46:41 +0100] NSMMReplicationPlugin -
>     write_changelog_and_ruv: can't add a change for
>     cn=LAN452ESP-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
>     (uniqid: fa90219d-6a8211e4-a42c901a-94623bee, optype: 16) to
>     changelog csn 546380d6000000020000
>     [12/Nov/2014:16:47:54 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: retry (49) the transaction
>     (csn=54638125000000020000) failed (rc=-30993 (BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>     [12/Nov/2014:16:47:54 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: failed to write entry with csn
>     (54638125000000020000); db error - -30993 BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>     [12/Nov/2014:16:47:54 +0100] NSMMReplicationPlugin -
>     write_changelog_and_ruv: can't add a change for
>     cn=LAN472EFLE-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
>     (uniqid: 1e5367ae-6a8311e4-a42c901a-94623bee, optype: 16) to
>     changelog csn 54638125000000020000
>     [12/Nov/2014:16:53:13 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: retry (49) the transaction
>     (csn=54638260000000020000) failed (rc=-30993 (BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>     [12/Nov/2014:16:53:13 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: failed to write entry with csn
>     (54638260000000020000); db error - -30993 BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>     [12/Nov/2014:16:53:13 +0100] NSMMReplicationPlugin -
>     write_changelog_and_ruv: can't add a change for
>     cn=MAT471-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
>     (uniqid: f4e70b85-6a8311e4-a42c901a-94623bee, optype: 16) to
>     changelog csn 54638260000000020000
>     [12/Nov/2014:16:55:11 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: retry (49) the transaction
>     (csn=546382d6000400020000) failed (rc=-30993 (BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>     [12/Nov/2014:16:55:11 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: failed to write entry with csn
>     (546382d6000400020000); db error - -30993 BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>     [12/Nov/2014:16:55:11 +0100] NSMMReplicationPlugin -
>     write_changelog_and_ruv: can't add a change for
>     cn=MEC592-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
>     (uniqid: 3c6d978a-6a8411e4-a42c901a-94623bee, optype: 16) to
>     changelog csn 546382d6000400020000
>     [12/Nov/2014:16:56:29 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: retry (49) the transaction
>     (csn=54638325000000020000) failed (rc=-30993 (BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>     [12/Nov/2014:16:56:29 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: failed to write entry with csn
>     (54638325000000020000); db error - -30993 BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>     [12/Nov/2014:16:56:29 +0100] NSMMReplicationPlugin -
>     write_changelog_and_ruv: can't add a change for
>     cn=PHY566-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
>     (uniqid: 6030dd93-6a8411e4-a42c901a-94623bee, optype: 16) to
>     changelog csn 54638325000000020000
>     [12/Nov/2014:16:57:20 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: retry (49) the transaction
>     (csn=5463835d000000020000) failed (rc=-30993 (BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
>     [12/Nov/2014:16:57:20 +0100] NSMMReplicationPlugin - changelog
>     program - _cl5WriteOperationTxn: failed to write entry with csn
>     (5463835d000000020000); db error - -30993 BDB0068
>     DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
>     [12/Nov/2014:16:57:20 +0100] NSMMReplicationPlugin -
>     write_changelog_and_ruv: can't add a change for
>     cn=PHY651K-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
>     (uniqid: 83f42395-6a8411e4-a42c901a-94623bee, optype: 16) to
>     changelog csn 5463835d000000020000
>

See https://bugzilla.redhat.com/show_bug.cgi?id=979169

>
>
>     Each time the failed group entries are different.
>
>     The replication agreements look liek this :
>     0 cn=Replication from ldap-edev.polytechnique.fr to
>     ldap-model.polytechnique.fr,cn=replica,cn=dc\\3Did\\2Cdc\\3Dpolytechnique\\2Cdc\\3Dedu,cn=mapping
>     tree,cn=config
>     objectClass: top
>     objectClass: nsDS5ReplicationAgreement
>     cn: Replication from ldap-edev.polytechnique.fr to
>     ldap-model.polytechnique.fr
>     description: Replication agreement from server
>     ldap-edev.polytechnique.fr to server ldap-model.polytechnique.fr
>     nsDS5ReplicaHost: ldap-model.polytechnique.fr
>     nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu
>     nsDS5ReplicaPort: 636
>     nsDS5ReplicaTransportInfo: SSL
>     nsDS5ReplicaBindDN: cn=RepliX,cn=config
>     nsDS5ReplicaBindMethod: simple
>     nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE entryusn
>     memberOf
>     nsDS5ReplicatedAttributeListTotal: (objectclass=*) $ EXCLUDE entryusn
>     nsds5ReplicaStripAttrs: modifiersName modifyTimestamp
>     internalModifiersName internalModifyTimestamp internalCreatorsname
>     nsds5replicaBusyWaitTime: 3
>     nsds5replicaTimeout: 30
>     nsDS5ReplicaCredentials: {DES}...
>     nsds5ReplicaEnabled: on
>     nsds50ruv: {replicageneration} 54636df3000000020000
>     nsds50ruv: {replica 1 ldap://ldap-model.polytechnique.fr:389}
>     54637b3c000000010000 54637dce000000010000
>     nsds50ruv: {replica 2 ldap://ldap-edev.polytechnique.fr:389}
>     54636ffe000000020000 54637d2d000000020000
>     nsruvReplicaLastModified: {replica 1
>     ldap://ldap-model.polytechnique.fr:389} 00000000
>     nsruvReplicaLastModified: {replica 2
>     ldap://ldap-edev.polytechnique.fr:389} 00000000
>     nsds5replicareapactive: 0
>     nsds5replicaLastUpdateStart: 20141112173922Z
>     nsds5replicaLastUpdateEnd: 20141112173922Z
>     nsds5replicaChangesSentSinceStartup: 2:1899/21484
>     nsds5replicaLastUpdateStatus: 0 Replica acquired successfully:
>     Incremental update succeeded
>     nsds5replicaUpdateInProgress: FALSE
>     nsds5replicaLastInitStart: 0
>     nsds5replicaLastInitEnd: 0
>
>
>     The DNA plug-in is disabled (excluding
>     https://fedorahosted.org/389/ticket/47410)
>
>
>     I've made a ticket concerning  this problem
>     (https://fedorahosted.org/389/ticket/47410)
>
> The ticket for this problem is of course 
> https://fedorahosted.org/389/ticket/47954, not the one i've written in 
> the original mail.
>
>
>
>
>
>
>     Another problem when doing these tests, with replication
>     agreements suspended is the following message in the error log:
>
>     12/Nov/2014:16:15:35 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:15:38 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:15:40 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:15:43 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:15:45 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:15:47 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:15:50 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:15:52 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:15:55 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:15:57 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:15:59 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:16:02 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:16:04 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:16:06 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:16:09 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:16:11 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:16:14 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:16:17 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:16:19 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:16:22 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:16:24 +0100] - Retry count exceeded in modify
>     ...
>     (it goes on for approx 5 minutes)
>     [12/Nov/2014:16:20:31 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:20:33 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:20:36 +0100] - Retry count exceeded in modify
>     [12/Nov/2014:16:20:42 +0100] - Retry count exceeded in modify
>
>
>     During these error messages the server doe snot consume CPU but it
>     is does not allow any changes to continue. Do i need to open
>     another ticket for this second problem?
>
>     Thank you!
>
>
>     Regards,
>     Andrey
>
>     --
>     389 users mailing list
>     389-users at lists.fedoraproject.org
>     https://admin.fedoraproject.org/mailman/listinfo/389-users
>
>
>
>
> --
> 389 users mailing list
> 389-users at lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/389-users

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/389-users/attachments/20141112/c9e15f4d/attachment.html>


More information about the 389-users mailing list