[389-users] Failed to send extended operation: LDAP error -1 (Can't contact LDAP server)

Mon May 5 15:24:26 UTC 2014

On 05/05/2014 08:55 AM, Graham Leggett wrote:
> On 05 May 2014, at 11:37 AM, Graham Leggett <minfrin at sharp.fm> wrote:
>
>>> It should be possible to add an N+1th replica to an N-node deployment. Replication agreements are peer-to-peer, so you just add a new replication agreement from each of the servers you want to feed changes to the N+1th (typically all of them).
>> What I've learned so far:
>>
>> - servera has "syntax checking" switched off, and contains data with syntax errors. The data is 15 years old.
>> - serverb has "syntax checking" switched on, but has successfully been able to replicate in the past. Now replication is broken with serverb.
>> - serverc has "syntax checking" switched on, and has never been able to replicate. Serverc is brand new.
>>
>> What appears to be happening is that during the replication process, an LDAP operation that is accepted on servera is being rejected by serverc. The replication process is brittle, and has not been coded to handle any kind of error during the replication process, and so fails abruptly with "ERROR bulk import abandoned" and no further explanation. The error that triggered the abort is only visible by turning trace logging on.
> With a higher level of trace logging I have learned some more.
>
> One of the objects being replicated is a large group containing about 21000 uniqueMembers. When it comes to replicate this object, the replication pauses for about 6 seconds or so, and at that point it times out, responding with the following misleading error message:
>
> [05/May/2014:15:33:36 +0100] NSMMReplicationPlugin - agmt="cn=Agreement serverc.example.com" (serverc:636): Failed to send extended operation: LDAP error -1 (Can't contact LDAP server)

You should only get this error message at the beginning of a replication 
session, which is framed by extended operations.  Normal update traffic 
is regular operations, not extended operations.

What could be happening is this:
Supplier is attempting to send an update that is "too big" for the 
default maxbersize setting.
If this is during a replica init, you will see the "ERROR: bulk import 
abandoned" message in the consumer error log.
If this is during an incremental update, you will see some sort of error.
When this happens, the consumer will immediately close the connection 
(to avoid a DoS attack).
The supplier will still attempt to send the End Session extended 
operation, and this will fail (-1 (Can't contact LDAP server)) because 
the consumer has closed the connection.

See https://fedorahosted.org/389/ticket/47606

>
> serverc is in Johannesburg, on a far slower connection than servera in DFW and serverb in London. It appears there is some kind of timeout that kicks in and causes the replication to suddenly be abandoned without warning.
>
> Does anyone know what timeout is used during replication and how you set this timeout?
>
> Regards,
> Graham
> --
>
> --
> 389 users mailing list
> 389-users at lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/389-users