[389-users] changelog deadlock replication failures with DNA

thierry bordaz tbordaz at redhat.com
Mon Jun 17 09:21:14 UTC 2013


Hi Mahadevan,

    I think you hit a new bug that I was able to reproduce.

    The problem is an incorrect handling of operation return code when
    there is an error in SLAPI_PLUGIN_BE_TXN_POST_ADD_FN.
    When the operation succeeds in the DB, DS (as a master) updates the
    CL and RUV in SLAPI_PLUGIN_BE_TXN_POST_ADD_FN phase.
    At this moment the operation is successful. However the update of
    CL/RUV may fail (your case). It correctly aborts the update of the
    CL AND the txn related to the operation in the DB (like if the
    operation never happened)... but it looks like there is a bug where
    it keeps the operation returned code.
    The operation return code should be a failure, but it is successful.

    I reproduced the problem with a debug version where I simulate a
    DB_DEADLOCK failure

        [17/Jun/2013:11:07:21 +0200] - slapd started. Listening on All
        Interfaces port 24502 for LDAP requests
        [17/Jun/2013:11:08:23 +0200] NSMMReplicationPlugin - changelog
        program - _cl5WriteOperationTxn: failed to write entry with csn
        (51bed207000000010000); db error - -30993 BDB0068
        DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
        [17/Jun/2013:11:08:23 +0200] NSMMReplicationPlugin -
        write_changelog_and_ruv: can't add a change for uid=td,dc=com
        (uniqid: 74589a01-d72d11e2-b4c2bdbf-b0a22af5, optype: 16) to
        changelog csn 51bed207000000010000


        [tbordaz at pctbordaz userRoot]$ tail -20
        /var/log/dirsrv/slapd-master_cl/access
        ...
        [17/Jun/2013:11:08:23 +0200] conn=2 op=1 ADD dn="uid=td,dc=com"
        [17/Jun/2013:11:08:23 +0200] conn=2 op=1 RESULT err=0 tag=105
        nentries=0 etime=0 csn=51bed207000000010000


    I do not know why your environment is prone to trigger db_deadlock
    (lot of replica agreements, VM, slow disks...).
    I think the best way to progress is that you fill a ticket/bug so
    that we may track the issue. Note this bug is possibly affecting all
    operations (ADD/MOD/MODRDN/DEL)

best regards
thierry
On 06/15/2013 01:03 AM, Mahadevan, Venkat wrote:
>
> > If the operation fails to write into the changelog, the operation 
> fails. In your case, it means that the ldapclient should receive an 
> error.
> > So it is like if the operation never happened and is not replicated.
>
> Hi Thierry,
>
> That’s what I would expect to, but that does not seem to be the case. 
> In my access log on the master server below,
>
> the client successfully adds the entry to the master server and 
> receives an error code return of 0:
>
> [12/Jun/2013:13:36:59 -0700] conn=19048 op=0 BIND dn="cn=Directory 
> Manager" method=128 version=3
>
> [12/Jun/2013:13:36:59 -0700] conn=19048 op=0 RESULT err=0 tag=97 
> nentries=0 etime=0 dn="cn=directory manager"
>
> [12/Jun/2013:13:36:59 -0700] conn=19048 op=1 ADD 
> dn="uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca"
>
> [12/Jun/2013:13:37:04 -0700] conn=19048 op=1 RESULT err=0 tag=105 
> nentries=0 etime=5 csn=51b8dbec002b02bd0000
>
> In the error log at the same time the above happens:
>
> [12/Jun/2013:13:37:02 -0700] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica dc=tst,dc=id,dc=ubc,dc=ca: LDAP error - 51
>
> [12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program 
> - _cl5WriteOperationTxn: retry (49) the transaction 
> (csn=51b8dbec002b02bd0000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker 
> killed to resolve a deadlock))
>
> [12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program 
> - _cl5WriteOperationTxn: failed to write entry with csn 
> (51b8dbec002b02bd0000); db error - -30994 DB_LOCK_DEADLOCK: Locker 
> killed to resolve a deadlock
>
> [12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - 
> write_changelog_and_ruv: can't add a change for 
> uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca (uniqid: 
> c0a7827c-d39f11e2-879add5e-44f44922, optype: 16) to changelog csn 
> 51b8dbec002b02bd0000
>
> The net result is that the entry is successfully added to the master 
> server but will never replication to any of the consumers.
>
> cheers,
>
> VM
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/389-users/attachments/20130617/a7d63c1b/attachment.html>


More information about the 389-users mailing list