[389-users] Repair replication

Fri Apr 20 19:24:44 UTC 2012

Hi All,

I wanted to update this issue as I've made some progress but replication is
still not working as it should.  I've removed the previous communication as
it was getting very long and I began to receive 'message too large'
responses from the list server.  The history of this post can be read in
the archives:
http://lists.fedoraproject.org/pipermail/389-users/2012-April/thread.html.

So, I've tried to simplify my efforts by removing the consumer replication
agreements for now to focus on getting the multi-master replication working
first.  To briefly review, I inherited two multi-master systems (A & B) and
A has been the only system running for many years.

To get replication working I've done the following:

1.  Initialize master B data from a nightly backup from master A as:

     ./bak2db bak/directory -n <my_suffix>

     - I see this in the error log:

[20/Apr/2012:10:30:31 -0700] -   Add Attribute readonly Value off

[20/Apr/2012:10:30:31 -0700] -   Add Attribute nsslapd-directory Value
/data/LDAP/slapd-<master A server name>/db/<my_suffix>
[20/Apr/2012:10:30:31 -0700] -   Del Attribute nsslapd-directory Value
/data/LDAP/slapd-<master B server name>/db/<my_suffix>

[20/Apr/2012:10:30:31 -0700] - WARNING!!: current Instance Config is
different from backed up configuration; The backup is restored.
[20/Apr/2012:10:30:31 -0700] - dblayer_restore: Removing staging area
/opt/fedora-ds/slapd-<master B server name>/db/../fribak.

*Is there any problem regarding the lines above that change the
'**nsslapd-directory"
attribute from it's original correct master B path to the path of
master A**as part of the initialization?  Or is this reset to the
correct path for
master B?*  *If I need to reset some attributes, how can I view the current
nsslapd-directory attribute from the command line with ldapsearch?*

2. Start slapd deamon on master B.

     From error log:

[20/Apr/2012:10:30:40 -0700] - Fedora-Directory/7.1 B2005.146.2010 starting
up
[20/Apr/2012:10:30:40 -0700] NSMMReplicationPlugin -
replica_check_for_data_reload: Warning: data for replica o=<my_suffix> was
reloaded and it no longer matches the data in the changelog (replica data >
changelog). Recreating the changelog file. This could affect replication
with replica's consumers in which case the consumers should be
reinitialized.

3. Create replication agreements between master A and B on both systems.
4. Run an initialization from the DS console on master B to master A.

Here is what I see from the logs:

error log on master B:

[20/Apr/2012:10:30:40 -0700] - slapd started.  Listening on All Interfaces
port 389 for LDAP requests
[20/Apr/2012:10:31:05 -0700] NSMMReplicationPlugin - Beginning total update
of replica "agmt="cn=<my_suffix>_to_<master_A>" (<master_A>:389)".
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin - Finished total update
of replica "agmt="cn=<my_suffix>_to_<master_A>" (<master_A>:389)". Sent
1718 entries.

*The above appears to have sent 1718 entries to master A.  And "replication
status' on master B says "incremental update succeeded".

*error log on master A:

[20/Apr/2012:10:30:40 -0700] NSMMReplicationPlugin - conn=1578 op=3
repl="o=my_suffix": Begin incremental protocol
[20/Apr/2012:10:30:40 -0700] NSMMReplicationPlugin - conn=1578 op=3
repl="o=my_suffix": Acquired replica
[20/Apr/2012:10:30:40 -0700] NSMMReplicationPlugin - conn=1578 op=3
repl="o=my_suffix": StartNSDS50ReplicationRequest: response=0 rc=0
[20/Apr/2012:10:30:40 -0700] NSMMReplicationPlugin - conn=1578 op=5
repl="o=my_suffix": Released replica
[20/Apr/2012:10:31:03 -0700] NSMMReplicationPlugin - conn=1579 op=3
repl="o=my_suffix": Begin total protocol
[20/Apr/2012:10:31:03 -0700] NSMMReplicationPlugin - conn=1579 op=3
repl="o=my_suffix": Acquired replica
[20/Apr/2012:10:31:03 -0700] NSMMReplicationPlugin -
multimaster_be_state_change: replica o=my_suffix is going offline;
disabling replication
[20/Apr/2012:10:31:04 -0700] NSMMReplicationPlugin -
agmt="cn=my_suffix_to_master_B" (master_B:389): State: backoff -> backoff
[20/Apr/2012:10:31:04 -0700] NSMMReplicationPlugin -
agmt="cn=my_suffix_to_master_B" (master_B:389): State: backoff -> backoff
[20/Apr/2012:10:31:04 -0700] NSMMReplicationPlugin -
agmt="cn=my_suffix_to_master_B" (master_B:389): No linger to cancel on the
connection
[20/Apr/2012:10:31:04 -0700] NSMMReplicationPlugin -
agmt="cn=my_suffix_to_master_B" (master_B:389): Disconnected from the
consumer
[20/Apr/2012:10:31:05 -0700] NSMMReplicationPlugin -
agmt="cn=my_suffix_to_master_B" (master_B:389): repl5_inc_stop: protocol
stopped after 0 seconds
[20/Apr/2012:10:31:05 -0700] NSMMReplicationPlugin - conn=0 op=0
repl="o=my_suffix": Replica in use locking_purl=conn=1579 id=3
[20/Apr/2012:10:31:05 -0700] NSMMReplicationPlugin -
replica_disable_replication: replica o=my_suffix is acquired
[20/Apr/2012:10:31:05 -0700] - WARNING: Import is running with
nsslapd-db-private-import-mem on; No other process is allowed to access the
database
[20/Apr/2012:10:31:05 -0700] NSMMReplicationPlugin - conn=1579 op=3
repl="o=my_suffix": StartNSDS50ReplicationRequest: response=0 rc=0
[20/Apr/2012:10:31:09 -0700] - import my_suffix: Workers finished; cleaning
up...
[20/Apr/2012:10:31:09 -0700] - import my_suffix: Workers cleaned up.
[20/Apr/2012:10:31:09 -0700] - import my_suffix: Indexing complete.
Post-processing...
[20/Apr/2012:10:31:09 -0700] - import my_suffix: Flushing caches...
[20/Apr/2012:10:31:09 -0700] - import my_suffix: Closing files...
[20/Apr/2012:10:31:10 -0700] - import my_suffix: Import complete.
Processed 1718 entries in 5 seconds. (343.60 entries/sec)

*The above log info looks as if it did 'acquire' replication from master B
and processed 1718 entries.*

[20/Apr/2012:10:31:10 -0700] NSMMReplicationPlugin -
multimaster_be_state_change: replica o=my_suffix is coming online; enabling
replication
[20/Apr/2012:10:31:10 -0700] NSMMReplicationPlugin -
_replica_configure_ruv: No ruv tombstone found for replica o=my_suffix.
Created a new one
[20/Apr/2012:10:31:10 -0700] NSMMReplicationPlugin - replica_reload_ruv:
Warning: new data for replica o=my_suffix does not match the data in the
changelog.
 Recreating the changelog file. This could affect replication with
replica's  consumers in which case the consumers should be reinitialized.
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin - conn=0 op=0
repl="o=my_suffix": Released replica
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin -
replica_enable_replication: replica o=my_suffix is relinquished
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin -
agmt="cn=my_suffix_to_master_B" (master_B:389): No linger to cancel on the
connection
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin -
agmt="cn=my_suffix_to_master_B" (master_B:389): Disconnected from the
consumer
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin -
agmt="cn=my_suffix_to_master_B" (master_B:389): State: start ->
ready_to_acquire_replica
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin - changelog program -
cl5DeleteDBSync: file for replica at (o=my_suffix) not found
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin -
agmt="cn=my_suffix_to_master_B" (master_B:389): State:
ready_to_acquire_replica -> wait_for_changes
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin - changelog program -
_cl5NewDBFile: semaphore
/opt/fedora-ds/slapd-<master_A>/changelogdb/1da9fe82-1dd211b2-80bc8f56-47cc0000.sema
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin - changelog program -
_cl5NewDBFile: maxConcurrentWrites=2
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin - changelog program -
_cl5GetEntryCount: 0 changes for replica 1da9fe82-1dd211b2-80bc8f56-47cc0000
[20/Apr/2012:10:31:11 -0700] NSMMReplicationPlugin - conn=1579 op=1722
repl="o=my_suffix": Replica not in use

*The above is the last of logs referring to this replication.  Is there
anything odd?

*The replication agreements are set to 'always keep directories in sync'
and since this manual initialization from the console the logs go back to
(every 5 min or so):

master A error log:

Unable to acquire replica: permission denied. The bind dn "cn=replication
manager,cn=config" does not have permission to supply replication updates
to the replica. Will retry later.

master B error log:

Unable to acquire replica: error: permission denied

*It seems as if the attempt to sync between master A & B is always from A
to B.  Is this normal, could this have anything to do with the
**'**nsslapd-directory"
attribute?

*As always any help is greatly appreciated.

Thanks in advance,

Herb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/389-users/attachments/20120420/7b716794/attachment.html>