[389-users] more MMR issues

Rich Megginson rmeggins at redhat.com
Fri Nov 13 23:59:13 UTC 2009


Robert Viduya wrote:
> I didn't get a response to my previous request for help and our 
> situation degenerated (we lost 3 of our 4 masters) to the point where 
> I felt we had to do a clean rebuild. We did that late last week into 
> the weekend and had set up a 2 masters and assorted hubs and slaves. 
> We used a clean ldif file to import into the first master, so no 
> previous replica IDs were carried over from the previous environment.
>
> We are running directory version 1.2.2 on RHEL5.4, both 64-bit.
>
> Things were running fine until this morning, when one of our masters 
> started reporting errors. We found this in it's errorlog:
Are there any errors before that?
>
> [10/Nov/2009:08:56:27 -0500] NSMMReplicationPlugin - 
> multimaster_be_state_change: replica 
> ou=people,dc=gted,dc=gatech,dc=edu is going offline; disabling 
> replication
> [10/Nov/2009:08:59:29 -0500] - WARNING: Import is running with 
> nsslapd-db-private-import-mem on; No other process is allowed to 
> access the database
> [10/Nov/2009:08:59:33 -0500] - ERROR bulk import abandoned
> [10/Nov/2009:08:59:34 -0500] - import people: Aborting all import 
> threads...
> [10/Nov/2009:08:59:42 -0500] - import people: Import threads aborted.
> [10/Nov/2009:08:59:43 -0500] - import people: Closing files...
> [10/Nov/2009:08:59:43 -0500] - import people: Import failed.
> [10/Nov/2009:09:01:51 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [10/Nov/2009:09:01:57 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [10/Nov/2009:09:02:01 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [10/Nov/2009:09:02:21 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [10/Nov/2009:09:02:26 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
> [10/Nov/2009:09:02:32 -0500] NSMMReplicationPlugin - 
> replica_replace_ruv_tombstone: failed to update replication update 
> vector for replica ou=people,dc=gted,dc=gatech,dc=edu: LDAP error - 1
>
>
> That last line repeats until we brought the server down. The log 
> _looks_ like someone/something triggered an import operation, but 
> no-one did, on either master.
>
> The errorlog on the other master shows the following:
Are there any errors before this?
>
> [10/Nov/2009:08:39:29 -0500] - repl5_inc_waitfor_async_results timed 
> out waiting for responses: 38 46
> [10/Nov/2009:08:39:54 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Warning: unable to receive 
> endReplication extended operation response (Bad parameter to an ldap 
> routine)
What's in the consumer access log at or around [10/Nov/2009:08:39:29 
-0500] and [10/Nov/2009:08:39:54 -0500] ?
> [10/Nov/2009:08:40:04 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:40:08 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:40:14 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:40:38 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:43:05 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:44:50 -0500] - repl5_inc_waitfor_async_results timed 
> out waiting for responses: 6 8
> [10/Nov/2009:08:47:08 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:47:08 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Incremental protocol: event 
> backoff_timer_expired should not occur in state start_backoff
> [10/Nov/2009:08:47:12 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:47:18 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Incremental update failed and 
> requires administrator action
Was there any administrator action taken here?
> [10/Nov/2009:08:55:01 -0500] - repl5_inc_waitfor_async_results timed 
> out waiting for responses: 13 14
> [10/Nov/2009:08:55:01 -0500] - repl5_inc_waitfor_async_results timed 
> out waiting for responses: 59 81
> [10/Nov/2009:08:55:14 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Warning: unable to receive 
> endReplication extended operation response (Bad parameter to an ldap 
> routine)
> [10/Nov/2009:08:55:24 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:55:28 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:55:34 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:55:46 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:56:10 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:56:58 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Unable to receive the response for a 
> startReplication extended operation to consumer (Bad parameter to an 
> ldap routine). Will retry later.
> [10/Nov/2009:08:58:34 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Replication bind with SIMPLE auth 
> resumed
> [10/Nov/2009:09:01:47 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Consumer failed to replay change 
> (uniqueid 51dccc08-9efe11de-8efe8516-22c1043e, CSN 
> 4af96f8a000200370000): Operations error. Will retry later.
> [10/Nov/2009:09:01:47 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Consumer failed to replay change 
> (uniqueid 5ad5610c-1dd211b2-80b9be51-952a0000, CSN 
> 4af96f8b000000370000): Operations error. Will retry later.
> [10/Nov/2009:09:01:47 -0500] NSMMReplicationPlugin - agmt="cn=people 
> rewbell gertrude" (gertrude:636): Consumer failed to replay change 
> (uniqueid 213cd58e-cd7b11de-b535d108-950067b1, CSN 
> 4af96fcf000000370000): Operations error. Will retry later.
>
> Again, that last line repeats until we brought down the errant server.
>
> We've seen this behavior a few times since upgrading. One of our 
> masters somehow thinks it's supposed to do an import and trashes it's 
> copy of the data. No person had triggered an import or a 
> supplier->consumer initialization. Are there conditions where the 
> directory server itself would trigger such an operation autonomously?
I've looked at the code again. The only automatic state transition is 
from the init state to the incremental update state (that is, after 
doing a replica init, the supplier will automatically being sending 
updates). Even if it exits the incremental protocol, it should start 
another incremental update.
>
> -- 
> 389 users mailing list
> 389-users at redhat.com
> https://www.redhat.com/mailman/listinfo/fedora-directory-users

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 3258 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.fedoraproject.org/pipermail/389-users/attachments/20091113/0c910c6b/attachment.bin>


More information about the 389-users mailing list