Hi Mahadevan,
I think you hit a new bug that I was able to reproduce.
The problem is an incorrect handling of operation return code when
there is an error in SLAPI_PLUGIN_BE_TXN_POST_ADD_FN.
When the operation succeeds in the DB, DS (as a master) updates the
CL and RUV in SLAPI_PLUGIN_BE_TXN_POST_ADD_FN phase.
At this moment the operation is successful. However the update of
CL/RUV may fail (your case). It correctly aborts the update of the
CL AND the txn related to the operation in the DB (like if the
operation never happened)... but it looks like there is a bug where
it keeps the operation returned code.
The operation return code should be a failure, but it is successful.
I reproduced the problem with a debug version where I simulate a
DB_DEADLOCK failure
[17/Jun/2013:11:07:21 +0200] - slapd started. Listening on All
Interfaces port 24502 for LDAP requests
[17/Jun/2013:11:08:23 +0200] NSMMReplicationPlugin - changelog
program - _cl5WriteOperationTxn: failed to write entry with csn
(51bed207000000010000); db error - -30993 BDB0068
DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[17/Jun/2013:11:08:23 +0200] NSMMReplicationPlugin -
write_changelog_and_ruv: can't add a change for uid=td,dc=com
(uniqid: 74589a01-d72d11e2-b4c2bdbf-b0a22af5, optype: 16) to
changelog csn 51bed207000000010000
[tbordaz@pctbordaz userRoot]$ tail -20
/var/log/dirsrv/slapd-master_cl/access
...
[17/Jun/2013:11:08:23 +0200] conn=2 op=1 ADD dn="uid=td,dc=com"
[17/Jun/2013:11:08:23 +0200] conn=2 op=1 RESULT err=0 tag=105
nentries=0 etime=0 csn=51bed207000000010000
I do not know why your environment is prone to trigger db_deadlock
(lot of replica agreements, VM, slow disks...).
I think the best way to progress is that you fill a ticket/bug so
that we may track the issue. Note this bug is possibly affecting all
operations (ADD/MOD/MODRDN/DEL)
best regards
thierry
On 06/15/2013 01:03 AM, Mahadevan, Venkat wrote:
> If the operation fails to write into the changelog, the operation
fails. In your case, it means that the ldapclient should receive an
error.
> So it is like if the operation never happened and is not replicated.
Hi Thierry,
That’s what I would expect to, but that does not seem to be the case.
In my access log on the master server below,
the client successfully adds the entry to the master server and
receives an error code return of 0:
[12/Jun/2013:13:36:59 -0700] conn=19048 op=0 BIND dn="cn=Directory
Manager" method=128 version=3
[12/Jun/2013:13:36:59 -0700] conn=19048 op=0 RESULT err=0 tag=97
nentries=0 etime=0 dn="cn=directory manager"
[12/Jun/2013:13:36:59 -0700] conn=19048 op=1 ADD
dn="uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca"
[12/Jun/2013:13:37:04 -0700] conn=19048 op=1 RESULT err=0 tag=105
nentries=0 etime=5 csn=51b8dbec002b02bd0000
In the error log at the same time the above happens:
[12/Jun/2013:13:37:02 -0700] NSMMReplicationPlugin -
replica_replace_ruv_tombstone: failed to update replication update
vector for replica dc=tst,dc=id,dc=ubc,dc=ca: LDAP error - 51
[12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program
- _cl5WriteOperationTxn: retry (49) the transaction
(csn=51b8dbec002b02bd0000) failed (rc=-30994 (DB_LOCK_DEADLOCK: Locker
killed to resolve a deadlock))
[12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin - changelog program
- _cl5WriteOperationTxn: failed to write entry with csn
(51b8dbec002b02bd0000); db error - -30994 DB_LOCK_DEADLOCK: Locker
killed to resolve a deadlock
[12/Jun/2013:13:37:04 -0700] NSMMReplicationPlugin -
write_changelog_and_ruv: can't add a change for
uid=jmeter325,dc=tst,dc=id,dc=ubc,dc=ca (uniqid:
c0a7827c-d39f11e2-879add5e-44f44922, optype: 16) to changelog csn
51b8dbec002b02bd0000
The net result is that the entry is successfully added to the master
server but will never replication to any of the consumers.
cheers,
VM