[389-users] 389ds v.1.3.2.24 replication deadlocks/retry count exceeded

Ivanov Andrey (M.) andrey.ivanov at polytechnique.fr
Wed Nov 12 17:55:58 UTC 2014


----- Mail original -----

> De: "Ivanov Andrey (M.)" <andrey.ivanov at polytechnique.fr>
> À: "General discussion list for the 389 Directory server project."
> <389-users at lists.fedoraproject.org>
> Envoyé: Mercredi 12 Novembre 2014 18:52:44
> Objet: [389-users] 389ds v.1.3.2.24 replication deadlocks/retry count
> exceeded

> Hi,

> I've continued testing 389ds v.1.3.2.24 on CentOS 7. I really have an
> impression that everything works fine (plugins etc) but the replication
> seems to be a little fragile. Both of the tickets i've already opened
> concern replication partially or completely
> (https://fedorahosted.org/389/ticket/47942 and
> https://fedorahosted.org/389/ticket/47950).

> Here is another issue with replication :
> i have two servers with multi-master agreements on each of them (the same
> configuration as in ticket https://fedorahosted.org/389/ticket/47942).

> We add/delete a lot of groups (943, to be exact). Each group may contain a
> large number of referenced entries, up to ~250 (uniqueMember: dn). MemberOf
> plugin is activated and works fine. Referential integrity plugin is also
> activated but of course it is of any sense only when deleting groups (or
> renaming them). It goes on for a long time (20-30 minutes or more). Some
> time after the beginning of the operations (typically 5-8 minutes) we have
> replication erros and inconsistency of the replica concerning the entries
> mentioned in error log.

> When adding and deleting groups the supplier is ok. Howevere the consumer has
> several (from one to four or five) groupe deletions/adds that are not
> replicated. The error on the supplier:

> [12/Nov/2014:16:46:42 +0100] NSMMReplicationPlugin - agmt="cn=Replication
> from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr"
> (ldap-model:636): Consumer failed to replay change (uniqueid
> fa90219d-6a8211e4-a42c901a-94623bee, CSN 546380d6000000020000): Operations
> error (1). Will retry later.
> [12/Nov/2014:16:47:55 +0100] NSMMReplicationPlugin - agmt="cn=Replication
> from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr"
> (ldap-model:636): Consumer failed to replay change (uniqueid
> 1e5367ae-6a8311e4-a42c901a-94623bee, CSN 54638125000000020000): Operations
> error (1). Will retry later.
> [12/Nov/2014:16:53:14 +0100] NSMMReplicationPlugin - agmt="cn=Replication
> from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr"
> (ldap-model:636): Consumer failed to replay change (uniqueid
> f4e70b85-6a8311e4-a42c901a-94623bee, CSN 54638260000000020000): Operations
> error (1). Will retry later.
> [12/Nov/2014:16:55:12 +0100] NSMMReplicationPlugin - agmt="cn=Replication
> from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr"
> (ldap-model:636): Consumer failed to replay change (uniqueid
> 3c6d978a-6a8411e4-a42c901a-94623bee, CSN 546382d6000400020000): Operations
> error (1). Will retry later.
> [12/Nov/2014:16:56:31 +0100] NSMMReplicationPlugin - agmt="cn=Replication
> from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr"
> (ldap-model:636): Consumer failed to replay change (uniqueid
> 6030dd93-6a8411e4-a42c901a-94623bee, CSN 54638325000000020000): Operations
> error (1). Will retry later.
> [12/Nov/2014:16:57:22 +0100] NSMMReplicationPlugin - agmt="cn=Replication
> from ldap-edev.polytechnique.fr to ldap-model.polytechnique.fr"
> (ldap-model:636): Consumer failed to replay change (uniqueid
> 83f42395-6a8411e4-a42c901a-94623bee, CSN 5463835d000000020000): Operations
> error (1). Will retry later.

> The corresponding errors on the consumer seem to hint deadlocks in these
> cases:
> [12/Nov/2014:16:46:41 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: retry (49) the transaction (csn=546380d6000000020000)
> failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> deadlock))
> [12/Nov/2014:16:46:41 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: failed to write entry with csn
> (546380d6000000020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker
> killed to resolve a deadlock
> [12/Nov/2014:16:46:41 +0100] NSMMReplicationPlugin - write_changelog_and_ruv:
> can't add a change for
> cn=LAN452ESP-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
> (uniqid: fa90219d-6a8211e4-a42c901a-94623bee, optype: 16) to changelog csn
> 546380d6000000020000
> [12/Nov/2014:16:47:54 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: retry (49) the transaction (csn=54638125000000020000)
> failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> deadlock))
> [12/Nov/2014:16:47:54 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: failed to write entry with csn
> (54638125000000020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker
> killed to resolve a deadlock
> [12/Nov/2014:16:47:54 +0100] NSMMReplicationPlugin - write_changelog_and_ruv:
> can't add a change for
> cn=LAN472EFLE-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
> (uniqid: 1e5367ae-6a8311e4-a42c901a-94623bee, optype: 16) to changelog csn
> 54638125000000020000
> [12/Nov/2014:16:53:13 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: retry (49) the transaction (csn=54638260000000020000)
> failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> deadlock))
> [12/Nov/2014:16:53:13 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: failed to write entry with csn
> (54638260000000020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker
> killed to resolve a deadlock
> [12/Nov/2014:16:53:13 +0100] NSMMReplicationPlugin - write_changelog_and_ruv:
> can't add a change for
> cn=MAT471-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
> (uniqid: f4e70b85-6a8311e4-a42c901a-94623bee, optype: 16) to changelog csn
> 54638260000000020000
> [12/Nov/2014:16:55:11 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: retry (49) the transaction (csn=546382d6000400020000)
> failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> deadlock))
> [12/Nov/2014:16:55:11 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: failed to write entry with csn
> (546382d6000400020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker
> killed to resolve a deadlock
> [12/Nov/2014:16:55:11 +0100] NSMMReplicationPlugin - write_changelog_and_ruv:
> can't add a change for
> cn=MEC592-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
> (uniqid: 3c6d978a-6a8411e4-a42c901a-94623bee, optype: 16) to changelog csn
> 546382d6000400020000
> [12/Nov/2014:16:56:29 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: retry (49) the transaction (csn=54638325000000020000)
> failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> deadlock))
> [12/Nov/2014:16:56:29 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: failed to write entry with csn
> (54638325000000020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker
> killed to resolve a deadlock
> [12/Nov/2014:16:56:29 +0100] NSMMReplicationPlugin - write_changelog_and_ruv:
> can't add a change for
> cn=PHY566-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
> (uniqid: 6030dd93-6a8411e4-a42c901a-94623bee, optype: 16) to changelog csn
> 54638325000000020000
> [12/Nov/2014:16:57:20 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: retry (49) the transaction (csn=5463835d000000020000)
> failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> deadlock))
> [12/Nov/2014:16:57:20 +0100] NSMMReplicationPlugin - changelog program -
> _cl5WriteOperationTxn: failed to write entry with csn
> (5463835d000000020000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker
> killed to resolve a deadlock
> [12/Nov/2014:16:57:20 +0100] NSMMReplicationPlugin - write_changelog_and_ruv:
> can't add a change for
> cn=PHY651K-2014,ou=2014,ou=Cours,ou=Enseignement,ou=Groupes,dc=id,dc=polytechnique,dc=edu
> (uniqid: 83f42395-6a8411e4-a42c901a-94623bee, optype: 16) to changelog csn
> 5463835d000000020000

> Each time the failed group entries are different.

> The replication agreements look liek this :
> 0 cn=Replication from ldap-edev.polytechnique.fr to
> ldap-model.polytechnique.fr,cn=replica,cn=dc\\3Did\\2Cdc\\3Dpolytechnique\\2Cdc\\3Dedu,cn=mapping
> tree,cn=config
> objectClass: top
> objectClass: nsDS5ReplicationAgreement
> cn: Replication from ldap-edev.polytechnique.fr to
> ldap-model.polytechnique.fr
> description: Replication agreement from server ldap-edev.polytechnique.fr to
> server ldap-model.polytechnique.fr
> nsDS5ReplicaHost: ldap-model.polytechnique.fr
> nsDS5ReplicaRoot: dc=id,dc=polytechnique,dc=edu
> nsDS5ReplicaPort: 636
> nsDS5ReplicaTransportInfo: SSL
> nsDS5ReplicaBindDN: cn=RepliX,cn=config
> nsDS5ReplicaBindMethod: simple
> nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE entryusn memberOf
> nsDS5ReplicatedAttributeListTotal: (objectclass=*) $ EXCLUDE entryusn
> nsds5ReplicaStripAttrs: modifiersName modifyTimestamp internalModifiersName
> internalModifyTimestamp internalCreatorsname
> nsds5replicaBusyWaitTime: 3
> nsds5replicaTimeout: 30
> nsDS5ReplicaCredentials: {DES}...
> nsds5ReplicaEnabled: on
> nsds50ruv: {replicageneration} 54636df3000000020000
> nsds50ruv: {replica 1 ldap://ldap-model.polytechnique.fr:389}
> 54637b3c000000010000 54637dce000000010000
> nsds50ruv: {replica 2 ldap://ldap-edev.polytechnique.fr:389}
> 54636ffe000000020000 54637d2d000000020000
> nsruvReplicaLastModified: {replica 1 ldap://ldap-model.polytechnique.fr:389}
> 00000000
> nsruvReplicaLastModified: {replica 2 ldap://ldap-edev.polytechnique.fr:389}
> 00000000
> nsds5replicareapactive: 0
> nsds5replicaLastUpdateStart: 20141112173922Z
> nsds5replicaLastUpdateEnd: 20141112173922Z
> nsds5replicaChangesSentSinceStartup: 2:1899/21484
> nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental
> update succeeded
> nsds5replicaUpdateInProgress: FALSE
> nsds5replicaLastInitStart: 0
> nsds5replicaLastInitEnd: 0

> The DNA plug-in is disabled (excluding
> https://fedorahosted.org/389/ticket/47410)

> I've made a ticket concerning this problem
> (https://fedorahosted.org/389/ticket/47410)

The ticket for this problem is of course https://fedorahosted.org/389/ticket/47954, not the one i've written in the original mail. 

> Another problem when doing these tests, with replication agreements suspended
> is the following message in the error log:

> 12/Nov/2014:16:15:35 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:15:38 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:15:40 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:15:43 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:15:45 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:15:47 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:15:50 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:15:52 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:15:55 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:15:57 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:15:59 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:16:02 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:16:04 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:16:06 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:16:09 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:16:11 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:16:14 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:16:17 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:16:19 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:16:22 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:16:24 +0100] - Retry count exceeded in modify
> ...
> (it goes on for approx 5 minutes)
> [12/Nov/2014:16:20:31 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:20:33 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:20:36 +0100] - Retry count exceeded in modify
> [12/Nov/2014:16:20:42 +0100] - Retry count exceeded in modify

> During these error messages the server doe snot consume CPU but it is does
> not allow any changes to continue. Do i need to open another ticket for this
> second problem?

> Thank you!

> Regards,
> Andrey

> --
> 389 users mailing list
> 389-users at lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/389-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/389-users/attachments/20141112/e1c6b357/attachment.html>


More information about the 389-users mailing list