Hello,
Some basic questions about the changelog:
1. What’s the location of the changelog where I can look up a CSN? 2. How do I see the setting for the max life of a CSN? 3. How do I view a particular CSN (i.e. its contents)?
Thanks, Sergei
On 11/03/2017 11:48 AM, Sergei Gerasenko wrote:
Hello,
Some basic questions about the changelog:
- What’s the location of the changelog where I can look up a CSN?
typically its something like:
/var/lib/dirsv/slapd-YOUR_INSTANCE/changelogdb
To look at the replication changelog you need to use the cli tool "cl-dump.pl"
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&a...
- How do I see the setting for the max life of a CSN?
There is no "max life" of a csn.
There is replication purging and changelog trimming that uses csns in RUV's to determine what can be removed. The admin guide talks about these in more detail.
- How do I view a particular CSN (i.e. its contents)?
csn:
59f9e547000200010000
Breaks down like this:
59f9e547 0002 0001 0000
The first 8 bits is the timestamp in hex: 59f9e547 --> 1509549383 seconds since EPOCH the next 4 is the sequence number (0002) the next 4 is the replica ID (0001) and the last 4 is the subsequence number (0000)
Thanks, Sergei
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
To look at the replication changelog you need to use the cli tool "cl-dump.pl"
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&a... https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwit5tqk5qLXAhVK7iYKHaacB40QFggmMAA&url=https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/8.0/html/Configuration_and_Command_Reference/Configuration_Command_File_Reference-cl_dump.pl_Dump_and_decode_changelog.html&usg=AOvVaw0EeBRb66mKeGlybKkp0z1O
Ok, thank you
- How do I see the setting for the max life of a CSN?
There is no "max life" of a csn.
Ok, what brought this up is that about every week, one of the machines in our environment breaks the replication with messages like this:
[01/Nov/2017:17:12:52.815891904 +0000] agmt="cn=meToXXXX" - Can't locate CSN 59f9d98a000000760000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized. [01/Nov/2017:17:12:52.820619690 +0000] NSMMReplicationPlugin - changelog program - agmt="cn=meXXXX": CSN 59f9d98a000000760000 not found, we aren't as up to date, or we purged [01/Nov/2017:17:12:52.828626595 +0000] NSMMReplicationPlugin - agmt="cn=meToXXXX": Data required to update replica has been purged from the changelog. The replica must be reinitialized.
So it made me think that perhaps the CSN record is removed too early? The ’76’ in the CSN is the machine having the problem. What do you think could cause problems of this kind?
There is replication purging and changelog trimming that uses csns in RUV's to determine what can be removed. The admin guide talks about these in more detail.
- How do I view a particular CSN (i.e. its contents)?
csn:
59f9e547000200010000
Breaks down like this:
59f9e547 0002 0001 0000
Yep, found that info previously, but thank you still!
On 11/03/2017 12:28 PM, Sergei Gerasenko wrote:
To look at the replication changelog you need to use the cli tool "cl-dump.pl"
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&a... https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwit5tqk5qLXAhVK7iYKHaacB40QFggmMAA&url=https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/8.0/html/Configuration_and_Command_Reference/Configuration_Command_File_Reference-cl_dump.pl_Dump_and_decode_changelog.html&usg=AOvVaw0EeBRb66mKeGlybKkp0z1O
Ok, thank you
- How do I see the setting for the max life of a CSN?
There is no "max life" of a csn.
Ok, what brought this up is that about every week
Ahh yes, this is the default replication purge interval (7 days)
https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/8.1/h...
Look for nsDS5ReplicaPurgeDelay
It could also be changelog trimming:
http://www.port389.org/docs/389ds/FAQ/changelog-trimming.html
So what this is telling me is that one of your replication agreements was over a week behind from the other replicas (not good). Was that agreement disabled for a while, and then enabled, for some reason?
, one of the machines in our environment breaks the replication with messages like this:
[01/Nov/2017:17:12:52.815891904 +0000] agmt="cn=meToXXXX" - Can't locate CSN 59f9d98a000000760000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized. [01/Nov/2017:17:12:52.820619690 +0000] NSMMReplicationPlugin - changelog program - agmt="cn=meXXXX": CSN 59f9d98a000000760000 not found, we aren't as up to date, or we purged [01/Nov/2017:17:12:52.828626595 +0000] NSMMReplicationPlugin - agmt="cn=meToXXXX": Data required to update replica has been purged from the changelog. The replica must be reinitialized.
So it made me think that perhaps the CSN record is removed too early? The ’76’ in the CSN is the machine having the problem. What do you think could cause problems of this kind?
There is replication purging and changelog trimming that uses csns in RUV's to determine what can be removed. The admin guide talks about these in more detail.
- How do I view a particular CSN (i.e. its contents)?
csn:
59f9e547000200010000
Breaks down like this:
59f9e547 0002 0001 0000
Yep, found that info previously, but thank you still!
Ok, what brought this up is that about every week
Ahh yes, this is the default replication purge interval (7 days)
https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/8.1/h... https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/8.1/html/Administration_Guide/Managing_Replication-Configuring-Replication-cmd.html
Look for nsDS5ReplicaPurgeDelay
It could also be changelog trimming:
http://www.port389.org/docs/389ds/FAQ/changelog-trimming.html http://www.port389.org/docs/389ds/FAQ/changelog-trimming.html
So what this is telling me is that one of your replication agreements was over a week behind from the other replicas (not good). Was that agreement disabled for a while, and then enabled, for some reason?
Not that I’m aware of. I’m using the repl-monitor script to monitor our replication and everything is inline (no CSN mismatch) until all of a sudden that happens.
Since I’m not an expert on ldap, do you mind posting the ldapsearch command to look up the value of nsDS5ReplicaPurgeDelay. I’m getting an empty value back. The subdirs of /var/lib/dirsrv/INSTANCE are:
bak cldb db ldif
Is cldb the changelog db?
On 11/03/2017 12:50 PM, Sergei Gerasenko wrote:
Ok, what brought this up is that about every week
Ahh yes, this is the default replication purge interval (7 days)
https://access.redhat.com/documentation/en-US/Red_Hat_Directory_Server/8.1/h...
Look for nsDS5ReplicaPurgeDelay
It could also be changelog trimming:
http://www.port389.org/docs/389ds/FAQ/changelog-trimming.html
So what this is telling me is that one of your replication agreements was over a week behind from the other replicas (not good). Was that agreement disabled for a while, and then enabled, for some reason?
Not that I’m aware of. I’m using the repl-monitor script to monitor our replication and everything is inline (no CSN mismatch) until all of a sudden that happens.
Since I’m not an expert on ldap, do you mind posting the ldapsearch command to look up the value of nsDS5ReplicaPurgeDelay. I’m getting an empty value back. The subdirs of /var/lib/dirsrv/INSTANCE are:
ldapsearch -D "cn=directory manger" -W -b cn=config objectClass=nsDS5Replica
bak cldb db ldif
Is cldb the changelog db?
Probably, you can name it whatever you want, the default is "changelogdbdir".
ldapsearch -D "cn=directory manger" -W -b cn=config objectClass=nsDS5Replica
nsDS5ReplicaPurgeDelay is not set listed in the output :(. It must be at the default value of one week?
Also, you mentioned that the agreement might have been disabled. What field of the nsds5replicationagreement class shows that?
Given the error in the log, and the low likelihood of the agreement being disabled for a week, what else can cause a node not to find a CSN?
Thanks!!
On 11/03/2017 01:23 PM, Sergei Gerasenko wrote:
ldapsearch -D "cn=directory manger" -W -b cn=config objectClass=nsDS5Replica
nsDS5ReplicaPurgeDelay is not set listed in the output :(. It must be at the default value of one week?
Also, you mentioned that the agreement might have been disabled. What field of the nsds5replicationagreement class shows that?
nsds5ReplicaEnabled
Given the error in the log, and the low likelihood of the agreement being disabled for a week, what else can cause a node not to find a CSN?
You have to manually disable (re-enable) an agreement, it does not just happen.
Have you restored from a backup recently? That could contain an old database ruv, and when replication kicks in it can't find the updates it needs from the other replicas.
You need to look through all the logs to further troubleshoot this. For now I would get everyone in sync then monitor replication, and archive your logs for the next week. That way you have a full data set to investigate if something goes wrong.
What version of 389 are you on? rpm -qa | grep 389-ds-base
Thanks!!
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Also, you mentioned that the agreement might have been disabled. What field of the nsds5replicationagreement class shows that?
nsds5ReplicaEnabled
Thank you
Given the error in the log, and the low likelihood of the agreement being disabled for a week, what else can cause a node not to find a CSN?
Have you restored from a backup recently?
No
You need to look through all the logs to further troubleshoot this. For now I would get everyone in sync then monitor replication, and archive your logs for the next week. That way you have a full data set to investigate if something goes wrong.
Ok, I’ll try to plow through the logs. I might still have them.
What version of 389 are you on? rpm -qa | grep 389-ds-base
389-ds-base-libs-1.3.5.10-21.el7_3.x86_64 389-ds-base-1.3.5.10-21.el7_3.x86_64
What does this tell you:
[25/Oct/2017:18:16:43.389794105 +0000] connection - conn=167482 fd=121 Incoming BER Element was 3 bytes, max allowable is 2097152 bytes. Change the nsslapd-maxbersize attribute in cn=config to increase.
This is confusing, it was 3 bytes which is < 2097152 and still the log message.
On 11/03/2017 02:53 PM, Sergei Gerasenko wrote:
Also, you mentioned that the agreement might have been disabled. What field of the nsds5replicationagreement class shows that?
nsds5ReplicaEnabled
Thank you
Given the error in the log, and the low likelihood of the agreement being disabled for a week, what else can cause a node not to find a CSN?
Have you restored from a backup recently?
No
You need to look through all the logs to further troubleshoot this. For now I would get everyone in sync then monitor replication, and archive your logs for the next week. That way you have a full data set to investigate if something goes wrong.
Ok, I’ll try to plow through the logs. I might still have them.
What version of 389 are you on? rpm -qa | grep 389-ds-base
389-ds-base-libs-1.3.5.10-21.el7_3.x86_64 389-ds-base-1.3.5.10-21.el7_3.x86_64
Actually you might be running into a known bug which is fixed in 1.3.6 and up. Sorry 1.3.5/el7_3 is no longer supported or maintained.
What does this tell you:
[25/Oct/2017:18:16:43.389794105 +0000] connection - conn=167482 fd=121 Incoming BER Element was 3 bytes, max allowable is 2097152 bytes. Change the nsslapd-maxbersize attribute in cn=config to increase.
This is confusing, it was 3 bytes which is < 2097152 and still the log message.
This happens when you try to open a ssl connection on the non-secure port. We have a bug open on this to make that error message means something useful (the message should be fixed in 1.3.7)
389-ds-base-libs-1.3.5.10-21.el7_3.x86_64 389-ds-base-1.3.5.10-21.el7_3.x86_64
Actually you might be running into a known bug which is fixed in 1.3.6 and up. Sorry 1.3.5/el7_3 is no longer supported or maintained.
Interesting! Can you link me to the bug?
What does this tell you:
[25/Oct/2017:18:16:43.389794105 +0000] connection - conn=167482 fd=121 Incoming BER Element was 3 bytes, max allowable is 2097152 bytes. Change the nsslapd-maxbersize attribute in cn=config to increase.
This is confusing, it was 3 bytes which is < 2097152 and still the log message.
This happens when you try to open a ssl connection on the non-secure port. We have a bug open on this to make that error message means something useful (the message should be fixed in 1.3.7)
OK, so this is benign more or less?
389-users@lists.fedoraproject.org