Hello,
I'm dealing with the update from 389-ds 1.3.9.1 to 1.4.3.22 in three multimaster servers:
- srv1 - srv2 - srv3
srv1 replicates to srv2 and srv3. srv2 replicates to srv1 and srv3. srv3 replicates to srv1 and srv2.
Let suppose I reinstall srv3 with 389ds 1.4.3.22, and I initialize it from srv1. This happens with success as expected. The replica is fine.
Then, I reinstall srv2, and I initialize it from srv3. This happens with success as expected, but just at the initialization end, the agreement from srv3 to srv1 stops to works.
In the console appears "Error (18) Can't acquire replica (Incremental update transient warning. Backing off, will retry update later.)" in the status of the agreements from srv3 to srv1. In the logs I see errors like
repl_plugin_name_cl - agmt="srv3 to srv1" (srv1:389): CSN 596c6868000075320000 not found, we aren't as up to date, or we purged clcache_load_buffer - Can't locate CSN 596c6868000075320000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.
The changelogdb Maximun Age is "7d", equals to the default nsDS5ReplicaPurgeDelay for the suffix.
This happens always, for every suffix.
To resolve the issue I have to re-initialize from srv3 to srv1 again and after the end of initialization from srv3 to srv2.
Resuming:
1) install srv2 OK 2) initialize srv1 to srv3 OK 3) initialize srv2 to srv3: the agreement srv1 to srv3 stops to work 4) initialize srv1 to srv3 again
I would like to know how to configure the Directory Server in order to avoid the above scenario. The problem is very similar to
https://access.redhat.com/solutions/2690611
but that document says that the problem was already fixed in 389-ds-base-1.3.5.10-15.el7_3 or later.
Could you help me?
Thank you very much Marco
does srv2 run 1.4.3.22 ?
you could try to delete the BDB region files, first stop the LDAP service, then delete the files /var/lib/dirsrv/slapd-xx/db/"__db.00*
or try a more recent 1.4.4.15 or 1.4.4.16
Thanks, M.
On Tue, Jun 1, 2021 at 5:30 AM Marco Favero m.faverof@gmail.com wrote:
Hello,
I'm dealing with the update from 389-ds 1.3.9.1 to 1.4.3.22 in three multimaster servers:
- srv1
- srv2
- srv3
srv1 replicates to srv2 and srv3. srv2 replicates to srv1 and srv3. srv3 replicates to srv1 and srv2.
Let suppose I reinstall srv3 with 389ds 1.4.3.22, and I initialize it from srv1. This happens with success as expected. The replica is fine.
Then, I reinstall srv2, and I initialize it from srv3. This happens with success as expected, but just at the initialization end, the agreement from srv3 to srv1 stops to works.
In the console appears "Error (18) Can't acquire replica (Incremental update transient warning. Backing off, will retry update later.)" in the status of the agreements from srv3 to srv1. In the logs I see errors like
repl_plugin_name_cl - agmt="srv3 to srv1" (srv1:389): CSN 596c6868000075320000 not found, we aren't as up to date, or we purged clcache_load_buffer - Can't locate CSN 596c6868000075320000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.
The changelogdb Maximun Age is "7d", equals to the default nsDS5ReplicaPurgeDelay for the suffix.
This happens always, for every suffix.
To resolve the issue I have to re-initialize from srv3 to srv1 again and after the end of initialization from srv3 to srv2.
Resuming:
- install srv2 OK
- initialize srv1 to srv3 OK
- initialize srv2 to srv3: the agreement srv1 to srv3 stops to work
- initialize srv1 to srv3 again
I would like to know how to configure the Directory Server in order to avoid the above scenario. The problem is very similar to
https://access.redhat.com/solutions/2690611
but that document says that the problem was already fixed in 389-ds-base-1.3.5.10-15.el7_3 or later.
Could you help me?
Thank you very much Marco _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
does srv2 run 1.4.3.22 ?
Yes, it is a fresh installation.
you could try to delete the BDB region files, first stop the LDAP service, then delete the files /var/lib/dirsrv/slapd-xx/db/"__db.00*
uhm... I keep these file in a RAM fs. srv2 and srv3 are fresh installations and received his first initialization.
or try a more recent 1.4.4.15 or 1.4.4.16
I'll see if I have a chance to try.
Thank you very much Kin Regards Marco
Even if I fix as described the above issue, if I restart a replica, then the suppliers stop to send the update and they claim with
"The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica."
So, after a restart, I have to reinitialize the restarted server in order to receive update :(
If I reboot the replica in place of a dirsrv restart (so I delete the "__db.00* in /dev/shm) the problem is still the same.
Kind Regards Marco
Gasp, I suspect the problem seems to be here. In the agreements I see
dn: cn=it 2--\3E1,cn=replica,cn=c\3Dit,cn=mapping tree,cn=config objectClass: top objectClass: nsds5replicationagreement cn: it 2-->1 cn: it 2-->1 nsDS5ReplicaRoot: c=it description: it 2-->1 nsDS5ReplicaHost: srv1.example.com nsDS5ReplicaPort: 389 nsDS5ReplicaBindMethod: simple nsDS5ReplicaTransportInfo: LDAP nsDS5ReplicaBindDN: cn=replication manager,cn=config nsds50ruv: {replicageneration} 60704f730000c3500000 nsds50ruv: {replica 50001 ldap://srv1.example.com:389} 607424dd0000c3510 000 60ba18fb0000c3510000 nsds50ruv: {replica 50000 ldap://srv.example.com:389} 6074264a0000c3500000 6 0ba190f0000c3500000 nsds50ruv: {replica 50002 ldap://srv2.example.com:389} 607426410000c3520 000 60ba19050000c3520000 nsruvReplicaLastModified: {replica 50001 ldap://srv1.example.com:389} 00 000000 nsruvReplicaLastModified: {replica 50000 ldap://srv.example.com:389} 0000000 0 nsruvReplicaLastModified: {replica 50002 ldap://srv2.example.com:389} 00 000000 nsds5replicareapactive: 0 nsds5replicaLastUpdateStart: 20210604124542Z nsds5replicaLastUpdateEnd: 20210604124542Z nsds5replicaChangesSentSinceStartup:: NTAwMDI6NC8wIA==
The replica ID 50000 corresponds to the server srv3.example.com, the first host installed in a set of three multimaster servers. The balancer host is srv.example.com. As suggested by dscreate I put the balancer host in the parameter "full_machine_name" for all LDAP servers. For a reason which I don't know the full_machine_name (the load balancer host) has been written in the ruv in place of the fqdn of the machine host containing the dirsrv installation. In this case, srv.example.com in place of srv3.example.com.
I suspect that if I reinstall all servers with their hostname in "full_machine_name" I resolve my issue.
Any idea?
Thank you very much
On 6/7/21 9:39 AM, Marco Favero wrote:
Gasp, I suspect the problem seems to be here. In the agreements I see
dn: cn=it 2--\3E1,cn=replica,cn=c\3Dit,cn=mapping tree,cn=config objectClass: top objectClass: nsds5replicationagreement cn: it 2-->1 cn: it 2-->1 nsDS5ReplicaRoot: c=it description: it 2-->1 nsDS5ReplicaHost: srv1.example.com nsDS5ReplicaPort: 389 nsDS5ReplicaBindMethod: simple nsDS5ReplicaTransportInfo: LDAP nsDS5ReplicaBindDN: cn=replication manager,cn=config nsds50ruv: {replicageneration} 60704f730000c3500000 nsds50ruv: {replica 50001 ldap://srv1.example.com:389} 607424dd0000c3510 000 60ba18fb0000c3510000 nsds50ruv: {replica 50000 ldap://srv.example.com:389} 6074264a0000c3500000 6 0ba190f0000c3500000 nsds50ruv: {replica 50002 ldap://srv2.example.com:389} 607426410000c3520 000 60ba19050000c3520000 nsruvReplicaLastModified: {replica 50001 ldap://srv1.example.com:389} 00 000000 nsruvReplicaLastModified: {replica 50000 ldap://srv.example.com:389} 0000000 0 nsruvReplicaLastModified: {replica 50002 ldap://srv2.example.com:389} 00 000000 nsds5replicareapactive: 0 nsds5replicaLastUpdateStart: 20210604124542Z nsds5replicaLastUpdateEnd: 20210604124542Z nsds5replicaChangesSentSinceStartup:: NTAwMDI6NC8wIA==
The replica ID 50000 corresponds to the server srv3.example.com, the first host installed in a set of three multimaster servers. The balancer host is srv.example.com. As suggested by dscreate I put the balancer host in the parameter "full_machine_name" for all LDAP servers. For a reason which I don't know the full_machine_name (the load balancer host) has been written in the ruv in place of the fqdn of the machine host containing the dirsrv installation. In this case, srv.example.com in place of srv3.example.com.
Hi marco,
the hostname in the RUV (nsds50ruv) is coming from 'nsslapd-localhost' attribute in the 'cn=config' entry (dse.ldif). I am unsure of the impact of this erroneously value (srv.example.com instead of srv3.example.com) in the RUV.
IMHO what is important for the RA to start a replication session is nsds5ReplicaHost and replicageneration. Of course it would be better that hosts are valid in RUV element but not sure it explains that srv1->srv3 stopped working.
If you can reproduce the problem, I would recommend that you enable replication logging (nsslapd-errorlog-level: 8192) on both side (srv1 and srv3) and reproduce the failure of the RA. Then isolated from access logs and error logs the replication session that fails.
regards thierry
I suspect that if I reinstall all servers with their hostname in "full_machine_name" I resolve my issue.
Any idea?
Thank you very much _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Hi Thierry,
thank you for these hints. I resolved by setting the "full_machine_name" to the fqdn of the host.
But I still have another issue very similar to https://access.redhat.com/solutions/2690611
Because I realized to have a RHDS 11.4 I try to ask to that support too.
Thank you Warm Regards Marco
On 6/17/21 10:55 AM, Marco Favero wrote:
Hi Thierry,
thank you for these hints. I resolved by setting the "full_machine_name" to the fqdn of the host.
Hi Marco,
good to know you fixed the issue. If I read you correctly you fixed it via setting nsDS5ReplicaHost=FQDN of the consumer host in the replication agreement supplier->consumer. What is surprising is that it was working before with a non fqdn and suddenly stopped working.
But I still have another issue very similar to https://access.redhat.com/solutions/2690611
With recent versions, this problem is either transient (a supplier does not know a CSN showed by the consumer but another supplier that knows this CSN will eventually update the consumer), either permanent (the consumer got offline longer than changelog maxage) and you may need to reinit the consumer.
regards thierry
Because I realized to have a RHDS 11.4 I try to ask to that support too.
Thank you Warm Regards Marco _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
On 6/17/21 10:55 AM, Marco Favero wrote:
Hi Marco,
good to know you fixed the issue. If I read you correctly you fixed it via setting nsDS5ReplicaHost=FQDN of the consumer host in the replication agreement supplier->consumer. What is surprising is that it was working before with a non fqdn and suddenly stopped working.
Hi Thierry, not really, sorry, maybe I didn't explain well. I set the "full_machine_name" in dscreate with the fqdn of the host runnig the 389ds in place of the fqdn of the balancer ip. It's the nsslapd-localhost parameter, I suppose.
With recent versions, this problem is either transient (a supplier does not know a CSN showed by the consumer but another supplier that knows this CSN will eventually update the consumer), either permanent (the consumer got offline longer than changelog maxage) and you may need to reinit the consumer.
I still have this issue. Are there conditions that determine this issue yet? It's as you describe: the only way to exit from that situation is the reinitialization.
I have three multimaster each other:
rh1 rh2 rh5
rh has also a scheduled agreement to dr-rh1. So, dr-rh1 is a consumer from rh1.
All is working fine. When I initialize rh1 from rh5, then the replica rh1 --> dr-rh1 stops to work and says "Error (18) Can't acquire replica (Incremental update transient warning. Backing off, will retry update later.)". Log claims that can't find a CSN. All other replica are fine.
So I have to reinitialize rh1 --> dr-rh01.
Thank you very much Warm Regards Marco
On 6/17/21 12:58 PM, Marco Favero wrote:
On 6/17/21 10:55 AM, Marco Favero wrote:
Hi Marco,
good to know you fixed the issue. If I read you correctly you fixed it via setting nsDS5ReplicaHost=FQDN of the consumer host in the replication agreement supplier->consumer. What is surprising is that it was working before with a non fqdn and suddenly stopped working.
Hi Thierry, not really, sorry, maybe I didn't explain well. I set the "full_machine_name" in dscreate with the fqdn of the host runnig the 389ds in place of the fqdn of the balancer ip. It's the nsslapd-localhost parameter, I suppose.
With recent versions, this problem is either transient (a supplier does not know a CSN showed by the consumer but another supplier that knows this CSN will eventually update the consumer), either permanent (the consumer got offline longer than changelog maxage) and you may need to reinit the consumer.
I still have this issue. Are there conditions that determine this issue yet? It's as you describe: the only way to exit from that situation is the reinitialization.
I have three multimaster each other:
rh1 rh2 rh5
rh has also a scheduled agreement to dr-rh1. So, dr-rh1 is a consumer from rh1.
All is working fine. When I initialize rh1 from rh5, then the replica rh1 --> dr-rh1 stops to work and says "Error (18) Can't acquire replica (Incremental update transient warning. Backing off, will retry update later.)". Log claims that can't find a CSN. All other replica are fine.
Doing reinit rh5->rh1, changelog of rh1 gets reset. If for some reason dr-rh1 was late compare to rh5, it is normal that rh1 can no longer update dr-rh1. Did you setup a RA rh5->dr-rh1 ?
thanks thierry
So I have to reinitialize rh1 --> dr-rh01.
Thank you very much Warm Regards Marco _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
Ah, I don't have RA rh5-->dr-rh1.
So, I could setup a RA from all multimaster to dr-rh1 to avoid this kind of problems.
I'm not sure to understand. Really, I have a real time RA from rh5 to rh1, and from rh1 to rh5. So, if I initialize rh1 from rh5, rh1 should still replicates to dr-rh1, because rh5 is always in synch with rh1...
Thank you Marco
On 6/17/21 2:11 PM, Marco Favero wrote:
Ah, I don't have RA rh5-->dr-rh1.
So, I could setup a RA from all multimaster to dr-rh1 to avoid this kind of problems.
I'm not sure to understand. Really, I have a real time RA from rh5 to rh1, and from rh1 to rh5. So, if I initialize rh1 from rh5, rh1 should still replicates to dr-rh1, because rh5 is always in synch with rh1...
Administrative task are sensitive and some side effect can impact replication. When you reinit rh5->rh1, then rh1 can still replicates to dr-rh1. However a side effect of reinit is that it clears the changelog of rh1. So if dr-rh1 is late, then rh1 having lost history of old updates (changelog reset) is able to connect to dr-rh1 but is no longer able to find in its cleared changelog, the old update it should start replication from.
The workaround is to give rh5 a chance to directly update dr-rh1.
thanks thierry
Thank you Marco _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
389-users@lists.fedoraproject.org