[389-users] Failed to send extended operation: LDAP error -1 (Can't contact LDAP server)

Graham Leggett minfrin at sharp.fm
Sun May 4 20:18:56 UTC 2014


On 04 May 2014, at 9:35 PM, David Boreham <david_list at boreham.org> wrote:

> ottomh I can't think what's up with your servers, but for sure there's something very odd going on (beyond just "not configured correctly" I suspect). The errors where the replica initialization reports that the database index files have been deleted underneath it are particularly wacky.
> 
> Anyway, I'd recommend that you turn up the logging level on the servers. This will hopefully reveal more about what's going wrong. Also look carefully in the log for the _first_ sign of trouble. That will likely be the root cause. I suspect you have a lot of fog of war showing up that's generated by some underlying prime error, that aught to have appeared first in the timeline.
> 
> It should be possible to add an N+1th replica to an N-node deployment. Replication agreements are peer-to-peer, so you just add a new replication agreement from each of the servers you want to feed changes to the N+1th (typically all of them).
> 
> In the log messages, where you were wondering which consumer server is throwing the error, the name of the replication agreement is typically printed. So the server it is trying to talk to is the one that's the target for the replication agreement mentioned.

On the serverc machine, when serverb has been instructed to initialise serverc, we see the following with replication debugging enabled:

[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 Acquired consumer connection extension
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 repl="o=Foo,c=za": Begin total protocol
[04/May/2014:23:03:36 +0200] - csngen_adjust_time: gen state before 5366aac20001:1399237314:0:0
[04/May/2014:23:03:36 +0200] - _csngen_adjust_local_time: gen state before 5366aac20001:1399237314:0:0
[04/May/2014:23:03:36 +0200] - _csngen_adjust_local_time: gen state after 5366ab280000:1399237416:0:0
[04/May/2014:23:03:36 +0200] - csngen_adjust_time: gen state after 536708fc0001:1399237416:0:24020
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 repl="o=Foo,c=ZA": Acquired replica
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - multimaster_be_state_change: replica o=Foo,c=za is going offline; disabling replication
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=0 op=0 repl="o=Foo,c=ZA": Replica in use locking_purl=conn=1 id=3
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - replica_disable_replication: replica o=Foo,c=za is acquired
[04/May/2014:23:03:36 +0200] - WARNING: Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 repl="o=Foo,c=za": StartNSDS90ReplicationRequest: response=0 rc=0
[04/May/2014:23:03:36 +0200] NSMMReplicationPlugin - conn=1 op=3 Relinquishing consumer connection extension

Nothing in the above seems to indicate an error that I can see, but we now see this two seconds later:

[04/May/2014:23:03:38 +0200] - ERROR bulk import abandoned
[04/May/2014:23:03:38 +0200] - import userRoot: Aborting all Import threads...
[04/May/2014:23:03:43 +0200] - import userRoot: Import threads aborted.
[04/May/2014:23:03:43 +0200] - import userRoot: Closing files...
[04/May/2014:23:03:43 +0200] - libdb: userRoot/uid.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/mailAlternateAddress.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/cn.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/mail.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/givenName.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/aci.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/entryrdn.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/id2entry.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/sn.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/parentid.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/objectclass.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/nsuniqueid.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/mailHost.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - libdb: userRoot/uniquemember.db4: unable to flush: No such file or directory
[04/May/2014:23:03:43 +0200] - import userRoot: Import failed.
[04/May/2014:23:03:43 +0200] NSMMReplicationPlugin - Aborting total update in progress for replicated area o=Foo,c=ZA connid=1
[04/May/2014:23:03:43 +0200] - process_bulk_import_op: NULL target sdn
[04/May/2014:23:03:43 +0200] NSMMReplicationPlugin - conn=1 op=-1 repl="o=Foo,c=ZA": Released replica held by locking_purl=conn=1 id=3

It appears that despite the initial sync as having failed, we ignore the above error and pretend all is ok, I suspect this is why we're getting the weird messages below.

[04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 Acquired consumer connection extension
[04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 repl="o=Foo,c=za": Begin incremental protocol
[04/May/2014:23:03:45 +0200] - csngen_adjust_time: gen state before 536708fc0001:1399237416:0:24020
[04/May/2014:23:03:45 +0200] - _csngen_adjust_local_time: gen state before 536708fc0001:1399237416:0:24020
[04/May/2014:23:03:45 +0200] - _csngen_adjust_local_time: gen state after 536709050000:1399237425:0:24020
[04/May/2014:23:03:45 +0200] - csngen_adjust_time: gen state after 536709050000:1399237425:0:24020
[04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 repl="o=Foo,c=ZA": Acquired replica
[04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 repl="o=Foo,c=za": StartNSDS90ReplicationRequest: response=0 rc=0
[04/May/2014:23:03:45 +0200] NSMMReplicationPlugin - conn=2 op=3 Relinquishing consumer connection extension
[04/May/2014:23:03:46 +0200] NSMMReplicationPlugin - conn=2 op=4 Acquired consumer connection extension
[04/May/2014:23:03:46 +0200] NSMMReplicationPlugin - replica_replace_ruv_tombstone: failed to update replication update vector for replica o=Foo,c=ZA: LDAP error - 1

Turning on connection debugging shows a whole lot of data transfer, then an anomaly just before the ERROR bulk import abandoned:

[04/May/2014:23:14:33 +0200] - listener got signaled
[04/May/2014:23:14:33 +0200] - activity on 64r
[04/May/2014:23:14:33 +0200] - read activity on 64
[04/May/2014:23:14:33 +0200] - listener got signaled
[04/May/2014:23:14:33 +0200] - activity on 64r
[04/May/2014:23:14:33 +0200] - read activity on 64
[04/May/2014:23:14:33 +0200] - ber_get_next failed for connection 2
[04/May/2014:23:14:33 +0200] - conn 2 leaving turbo mode due to 3
[04/May/2014:23:14:33 +0200] - listener got signaled
[04/May/2014:23:14:33 +0200] - ERROR bulk import abandoned

We got "ber_get_next failed for connection 2" without being told what exactly failed. Will dig on the other side to see if there are any clues.

Regards,
Graham
--




More information about the 389-users mailing list