[389-users] replication from 1.2.8.3 to 1.2.10.4

Wed Jul 11 23:17:32 UTC 2012

On 07/11/2012 11:12 AM, Robert Viduya wrote:
> Is replication from a 1.2.8.3 server to a 1.2.10.4 server known to work or not work?  We're having changelog issues.
>
> Background:
>
> We have an ldap service consisting of 3 masters, 2 hubs and 16 slaves.  All were running 1.2.8.3 since last summer with no issues.  This summer, we decided to bring them all up to the latest stable release, 1.2.10.4.  We can't afford a lot of downtime for the service as a whole, but with the redundancy level we have, we can take down a machine or two at a time without user impact.
>
> We started with one slave, did a clean install of 1.2.10.4 on it, set up replication agreements from our 1.2.8.3 hubs to it and watched it for a week or so.  Everything looked fine, so we started rolling through the rest of the slave servers, got them all running 1.2.10.4 and so far haven't seen any problems.
>
> A couple of days ago, I did one of our two hubs.  The first time I bring up the daemon after doing the initial import of our ldap data everything seems fine.  However, we start seeing errors the first time we restart:
>
> [11/Jul/2012:10:43:58 -0400] - slapd shutting down - signaling operation threads
> [11/Jul/2012:10:43:58 -0400] - slapd shutting down - waiting for 2 threads to terminate
> [11/Jul/2012:10:44:01 -0400] - slapd shutting down - closing down internal subsystems and plugins
> [11/Jul/2012:10:44:02 -0400] - Waiting for 4 database threads to stop
> [11/Jul/2012:10:44:04 -0400] - All database threads now stopped
> [11/Jul/2012:10:44:04 -0400] - slapd stopped.
> [11/Jul/2012:10:45:00 -0400] - 389-Directory/1.2.10.4 B2012.101.2023 starting up
> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN [4ffdca7e000000330000] from RUV [changelog max RUV] is larger than the max CSN [4ffb605d000000330000] from RUV [database RUV] for element [{replica 51} 4ffb602b000300330000 4ffdca7e000000330000]
> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: data for replica ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu does not match the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers in which case the consumers should be reinitialized.
> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN [4ffdca70000000340000] from RUV [changelog max RUV] is larger than the max CSN [4ffb7098000100340000] from RUV [database RUV] for element [{replica 52} 4ffb6ea2000000340000 4ffdca70000000340000]
> [11/Jul/2012:10:45:07 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: data for replica ou=people,dc=gted,dc=gatech,dc=edu does not match the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers in which case the consumers should be reinitialized.
> [11/Jul/2012:10:45:08 -0400] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
> [11/Jul/2012:10:45:08 -0400] - Listening on All Interfaces port 636 for LDAPS requests

The problem is that hubs have changelogs but dedicated consumers do not.

Were either of the replicas with ID 51 or 52 removed/deleted at some 
point in the past?

>
> The _second_ restart is even worse, we get more error messages (see below) and then the daemon dies

Dies?  Exits?  Crashes?  Core files?  Do you see any ns-slapd segfault 
messages in /var/log/messages?  When you restart the directory server 
after it dies, do you see "Disorderly Shutdown" messages in the 
directory server errors log?

> after it says it's listening on it's ports:
>
> [11/Jul/2012:10:45:32 -0400] - slapd shutting down - signaling operation threads
> [11/Jul/2012:10:45:32 -0400] - slapd shutting down - waiting for 29 threads to terminate
> [11/Jul/2012:10:45:34 -0400] - slapd shutting down - closing down internal subsystems and plugins
> [11/Jul/2012:10:45:35 -0400] - Waiting for 4 database threads to stop
> [11/Jul/2012:10:45:36 -0400] - All database threads now stopped
> [11/Jul/2012:10:45:36 -0400] - slapd stopped.
> [11/Jul/2012:10:46:11 -0400] - 389-Directory/1.2.10.4 B2012.101.2023 starting up
> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 68 ldap://gtedm3.iam.gatech.edu:389} 4be339e6000000440000 4ffdc9a1000000440000] which is present in RUV [database RUV]
> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 71 ldap://gtedm4.iam.gatech.edu:389} 4be6031e000000470000 4ffdc9a8000000470000] which is present in RUV [database RUV]
> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN [4ffb62a2000100330000] from RUV [changelog max RUV] is larger than the max CSN [4ffb605d000000330000] from RUV [database RUV] for element [{replica 51} 4ffb605d000000330000 4ffb62a2000100330000]
> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: data for replica ou=accounts,ou=gtaccounts,ou=departments,dc=gted,dc=gatech,dc=edu does not match the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers in which case the consumers should be reinitialized.
> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 69 ldap://gtedm3.iam.gatech.edu:389} 4be339e4000000450000 4ffdc9a2000000450000] which is present in RUV [database RUV]
> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 72 ldap://gtedm4.iam.gatech.edu:389} 4be6031d000000480000 4ffdc9a9000300480000] which is present in RUV [database RUV]
> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - ruv_compare_ruv: the max CSN [4ffb78bc000000340000] from RUV [changelog max RUV] is larger than the max CSN [4ffb7098000100340000] from RUV [database RUV] for element [{replica 52} 4ffb7098000100340000 4ffb78bc000000340000]
> [11/Jul/2012:10:46:11 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: data for replica ou=people,dc=gted,dc=gatech,dc=edu does not match the data in the changelog. Recreating the changelog file. This could affect replication with replica's consumers in which case the consumers should be reinitialized.
> [11/Jul/2012:10:46:11 -0400] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
> [11/Jul/2012:10:46:11 -0400] - Listening on All Interfaces port 636 for LDAPS requests
>
> At this point, the only way I've found to get it back is to clean out the changelog and db directories and re-import the ldap data from scratch.  Essentially we can't restart without having to re-import.  I've done this a couple of times already and it's entirely reproducible.
So every time you shutdown the server, and attempt to restart it, it 
doesn't start until you re-import?
>
> I've checked and ensured that there's no obsolete masters that need to be CLEANRUVed.  I've also noticed that the errors _seem_ to be only affecting our second and third suffix.  We have three suffixes defined, but I haven't seen any error messages for the first one.
>
> Has anyone seen anything like this?  We're not sure if this is a general 1.2.10.4 issue or if it only occurs if when replicating from 1.2.8.3 to 1.2.10.4.  If it's the former, we cannot proceed with getting the rest of the servers up to 1.2.10.4.  If it's the latter, then we need to expedite getting everything up to 1.2.10.4.

These do not seem like issues related to replicating from 1.2.8 to 
1.2.10.  Have you tried a simple test of setting up 2 1.2.10 masters and 
attempting to replicate your data between them?

> --
> 389 users mailing list
> 389-users at lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/389-users