I'm having another replication problem where changes made on a particular server are not being replicated outward at all. Right now, I'm trying to determine what's going on during the replication process.
(Caveat: I'm still running an old version of 389ds: v1.3.10. In particular, the dsconf utility does not exist.)
My understanding is that when a server receives a change from a client, it wraps it up as a CSN and starts a replication session with its peers, during which it sends a message that states the greatest CSN that it originated. First off, is that a correct understanding?
If so, how can I determine what CSN a particular server is telling its replication peers during those sessions? I have a feeling that this server is, for some reason, sending an inaccurate number.
In the cn=replica,cn=...,cn=mapping tree,cn=config tree, there are entries for each of the servers topology peers, and they contain nsds50ruv attributes that seem to be the RUVs that that server has received from those peers, right? But the nsds50ruv attribute also exists directly in the cn=replica if you explicitly ask for it. Is it possible that this is the server's own RUV?
Can I rely on the nsds50ruv attributes on this server's peers' cn=replica nsds50ruv attribute values to be an accurate reflection of what this server is sending as its CSN in replication sessions?
Any other way to see what's going on in a replication session? (I'm even trying to decrypt a network capture, but I'm not having any luck with that yet.)
In particular, I see the max CSN for this server in all of these RUVs less than CSNs recorded in the server's own log files.
On 29 Feb 2024, at 05:20, William Faulk d4hgcdgdmj@liamekaens.com wrote:
I'm having another replication problem where changes made on a particular server are not being replicated outward at all. Right now, I'm trying to determine what's going on during the replication process.
(Caveat: I'm still running an old version of 389ds: v1.3.10. In particular, the dsconf utility does not exist.)
My understanding is that when a server receives a change from a client, it wraps it up as a CSN and starts a replication session with its peers, during which it sends a message that states the greatest CSN that it originated. First off, is that a correct understanding?
Might be worth re-reading https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....
It doesn't send a single CSN, the replication compares the RUVs and determines the range of CSNs that are missing from the consumer.
It's also not immediate. Between the server accepting a change (add, mod etc), the change is associated to a CSN. But then there may be a delay before the two nodes actually communicate and exchange data.
If so, how can I determine what CSN a particular server is telling its replication peers during those sessions? I have a feeling that this server is, for some reason, sending an inaccurate number.
Generally you'd need replication logging (errorloglevel 8192). But it's very noisy and can be hard to read. What you need to see is the ranges that they agree to send.
Also remember CSN's are a monotonic lamport clock. This means they only ever advance and can never step backwards. So they have some different properties to what you may expect. If they ever go backwards I think the replication handler throws a pretty nasty error.
In the cn=replica,cn=...,cn=mapping tree,cn=config tree, there are entries for each of the servers topology peers, and they contain nsds50ruv attributes that seem to be the RUVs that that server has received from those peers, right? But the nsds50ruv attribute also exists directly in the cn=replica if you explicitly ask for it. Is it possible that this is the server's own RUV?
I *think* so. It's been a while since I had to look. The nsds50ruv shows the ruv of the server, and I think the other replica entries are "what the peers ruv was last time". But I think Thierry or Pierre would know more about that then me. Some of the replication monitoring code in newer versions does this for you, so I'd probably advise you attempt to upgrade your environment. 1.3 is really old at this point (And I'm not sure if even RH or SUSE still support that version anymore).
Can I rely on the nsds50ruv attributes on this server's peers' cn=replica nsds50ruv attribute values to be an accurate reflection of what this server is sending as its CSN in replication sessions?
Any other way to see what's going on in a replication session? (I'm even trying to decrypt a network capture, but I'm not having any luck with that yet.)
In particular, I see the max CSN for this server in all of these RUVs less than CSNs recorded in the server's own log files.
The problem here is that to read the RUV's and then compare them, you need to read each RUV from each server and then check if they are advancing (not that they are equal). See, it's okay if RUV's are not the same between two servers, because that can simply indicate that a server has accepted a write and not yet sent it to another node. In fact it's common in busy environments that every server has "slightly different state" because they have to continually replicate and converge.
For example, imagine some user A changes their password. Now that change has to propogate and converge between all the nodes in the topology. While that convergence is occuring, then another user B could be changing their password. This can leave with servers where:
* A and B passwords are original * A password is changed, B original * A password origin, B changed * A and B have been changed.
And all four of these states are valid!
If you want to assert that "Some change I made at CSN X is on all servers" then you would need to read and parse the ruv and ensure that all of them are at or past that CSN for that replica id.
Either way - it's not trivial :)
-- Sincerely,
William Brown
Senior Software Engineer, Identity and Access Management SUSE Labs, Australia
Might be worth re-reading
Well, I still don't really know the details of the replication process.
I have deduced that changes originated on a replica seem to prompt that replica to start a replication process with its peers, but I don't really know what happens then. There's a comparison of the RUVs of the two replicas, but does the initiating system send its RUV to the receiver, or does it go the other way, or do both happen? Does the comparison prompt the comparing system to send the changes it thinks the other system needs, or does it cause the comparing system to request new changes from the other? Maybe none of this really makes much difference, but the lack of technical detail around this makes me just question everything.
It doesn't send a single CSN, the replication compares the RUVs and determines the range of CSNs that are missing from the consumer.
Sure, but notionally any changes that originated on that replica would be reflected in the max CSN for itself in the RUV that is used to compare. And at least one side is sending its RUV to the other during the replication process.
It's also not immediate. Between the server accepting a change (add, mod etc), the change is associated to a CSN. But then there may be a delay before the two nodes actually communicate and exchange data.
Sure, but the changes originated on this replica haven't made it to other replicas in weeks. This isn't a mere delay in replication.
Generally you'd need replication logging (errorloglevel 8192). But it's very noisy and can be hard to read. What you need to see is the ranges that they agree to send.
Okay. I've done that and haven't had a chance to pore through them yet.
Also remember CSN's are a monotonic lamport clock. This means they only ever advance and can never step backwards. So they have some different properties to what you may expect. If they ever go backwards I think the replication handler throws a pretty nasty error.
I don't think it's going backwards. What I'm trying to rule out is that the replica is failing to advance its max CSN in the RUV being used to compare.
I *think* so. It's been a while since I had to look. The nsds50ruv shows the ruv of the server, and I think the other replica entries are "what the peers ruv was last time".
Well, it's at least nice to hear that my guess at least isn't asinine. :)
replication monitoring code in newer versions does this for you, so I'd probably advise you attempt to upgrade your environment. 1.3 is really old at this point
I've been trying to get the current environment stable enough that I feel comfortable going through the relatively lengthy upgrade process. I think I'm going to have to adjust my comfort level.
I'm not sure if even RH or SUSE still support that version anymore).
RedHat does, as it's what's in RHEL7.9, which is supported for another, uh, 4 months. They're working on this with me. I'm still just trying to understand the system better so that I can try to be productive while I'm waiting on them to come up with ideas.
The problem here is that to read the RUV's and then compare them, you need to read each RUV from each server and then check if they are advancing (not that they are equal).
The problem is that the changes in my environment are few enough that all the replicas' RUVs _are_ equal the majority of the time. I'm not in front of that system as I respond right now, so my details might be wrong, but I'm asking about all of this because every RUV I see in all of the replicas is the same, and it shows a max CSN for this one replica that's much older than the CSNs I see it reference in the logs about changes originating on the replica. The CSNs I see in the logs when a new change is made are referencing the current time in them, while the max CSN I see in the RUVs is from 4 months ago.
Maybe it *did* go backwards somehow and that's why it's not working. Not that that would really help me understand what actually went wrong any better than I do now.
If you want to assert that "Some change I made at CSN X is on all servers" then you would need to read and parse the ruv and ensure that all of them are at or past that CSN for that replica id.
Well, you'd think so. I've got that problem, too, where some CSNs just seem to get missed, but the max CSN in the RUV is well past that. But that's a different problem and not the one I'm working on now.
Thanks for the input.
On 2/29/24 05:12, William Faulk wrote:
Might be worth re-reading
Well, I still don't really know the details of the replication process.
I have deduced that changes originated on a replica seem to prompt that replica to start a replication process with its peers, but I don't really know what happens then.
Replication is done by replica agreement that is waken up when a new updates gets into the changelog. The new updates can be received directly from a LDAP client or from replication itself.
There's a comparison of the RUVs of the two replicas, but does the initiating system send its RUV to the receiver, or does it go the other way, or do both happen?
IIRC only the remote replica sends its RUV. Then the RA receiving the RUV will compare it with its own RUV to detect what is the oldest update that the remote replica ignore.
Does the comparison prompt the comparing system to send the changes it thinks the other system needs, or does it cause the comparing system to request new changes from the other?
Yes the RUV contains latest received updates for all the replicas.
Maybe none of this really makes much difference, but the lack of technical detail around this makes me just question everything.
It makes perfectly sense and show you already know deeply replication process.
It doesn't send a single CSN, the replication compares the RUVs and determines the range of CSNs that are missing from the consumer.
Sure, but notionally any changes that originated on that replica would be reflected in the max CSN for itself in the RUV that is used to compare. And at least one side is sending its RUV to the other during the replication process.
Yes the remote replica (named consumer IIRC) sends back its RUV to the request send by the RA.
It's also not immediate. Between the server accepting a change (add, mod etc), the change is associated to a CSN. But then there may be a delay before the two nodes actually communicate and exchange data.
Sure, but the changes originated on this replica haven't made it to other replicas in weeks. This isn't a mere delay in replication.
Usually replication occurs in few seconds. if it is not replicated for weeks, then replicaiton is broken and you need to identify in the replication debug log from the both sides (supplier/consumer) the reason of that breakage
Generally you'd need replication logging (errorloglevel 8192). But it's very noisy and can be hard to read. What you need to see is the ranges that they agree to send.
Okay. I've done that and haven't had a chance to pore through them yet.
Quite difficult to read, espcially if there are multiple RA playing around. You may look in parallel to the code to understand the purpose of those messages
Also remember CSN's are a monotonic lamport clock. This means they only ever advance and can never step backwards. So they have some different properties to what you may expect. If they ever go backwards I think the replication handler throws a pretty nasty error.
I don't think it's going backwards. What I'm trying to rule out is that the replica is failing to advance its max CSN in the RUV being used to compare.
Comparison of RUV. You need to dump RUV on both servers (consumer/supplier) then compare PER replica the maxcsn. The replication will start from the CSN that is the smallest of the maxcsn. So a maxCSN may not move until all the others are in sync
I *think* so. It's been a while since I had to look. The nsds50ruv shows the ruv of the server, and I think the other replica entries are "what the peers ruv was last time".
Well, it's at least nice to hear that my guess at least isn't asinine. :)
replication monitoring code in newer versions does this for you, so I'd probably advise you attempt to upgrade your environment. 1.3 is really old at this point
I've been trying to get the current environment stable enough that I feel comfortable going through the relatively lengthy upgrade process. I think I'm going to have to adjust my comfort level.
I'm not sure if even RH or SUSE still support that version anymore).
RedHat does, as it's what's in RHEL7.9, which is supported for another, uh, 4 months. They're working on this with me. I'm still just trying to understand the system better so that I can try to be productive while I'm waiting on them to come up with ideas.
The problem here is that to read the RUV's and then compare them, you need to read each RUV from each server and then check if they are advancing (not that they are equal).
The problem is that the changes in my environment are few enough that all the replicas' RUVs _are_ equal the majority of the time. I'm not in front of that system as I respond right now, so my details might be wrong, but I'm asking about all of this because every RUV I see in all of the replicas is the same, and it shows a max CSN for this one replica that's much older than the CSNs I see it reference in the logs about changes originating on the replica. The CSNs I see in the logs when a new change is made are referencing the current time in them, while the max CSN I see in the RUVs is from 4 months ago.
Maybe it *did* go backwards somehow and that's why it's not working. Not that that would really help me understand what actually went wrong any better than I do now.
Something important with RUV is the 'replicageneration' it should be identical on both side. For the problematic server, does the RUV evolve or not ?
If you want to assert that "Some change I made at CSN X is on all servers" then you would need to read and parse the ruv and ensure that all of them are at or past that CSN for that replica id.
Well, you'd think so. I've got that problem, too, where some CSNs just seem to get missed, but the max CSN in the RUV is well past that. But that's a different problem and not the one I'm working on now.
Thanks for the input.
Hi William,
I don't think it's going backwards. What I'm trying to rule out is that
the replica is failing to advance its max CSN in the RUV being used to compare.
Since you see CSN 4 months after the RUV, I think that your suspicion is right: The RUV is not updated any more. FYI: There is a list of pending operations to ensure that the RUV is not updated while an older operation is not yet completed. And I suspect that you hit a bug about this list. I remember that we fixed something in that area a few years ago ... As the list in memory, I think that simply restarting the server may fix the issue ... The RUV should be updated after next change and the old changes should then get replicated (Unless the changelog get discarded when restarting in such case you will have to reinitialize the other replica ... )
Regards Pierre
On Thu, Feb 29, 2024 at 5:12 AM William Faulk d4hgcdgdmj@liamekaens.com wrote:
Might be worth re-reading
Well, I still don't really know the details of the replication process.
I have deduced that changes originated on a replica seem to prompt that replica to start a replication process with its peers, but I don't really know what happens then. There's a comparison of the RUVs of the two replicas, but does the initiating system send its RUV to the receiver, or does it go the other way, or do both happen? Does the comparison prompt the comparing system to send the changes it thinks the other system needs, or does it cause the comparing system to request new changes from the other? Maybe none of this really makes much difference, but the lack of technical detail around this makes me just question everything.
It doesn't send a single CSN, the replication compares the RUVs and
determines the
range of CSNs that are missing from the consumer.
Sure, but notionally any changes that originated on that replica would be reflected in the max CSN for itself in the RUV that is used to compare. And at least one side is sending its RUV to the other during the replication process.
It's also not immediate. Between the server accepting a change (add, mod
etc), the
change is associated to a CSN. But then there may be a delay before the
two nodes actually
communicate and exchange data.
Sure, but the changes originated on this replica haven't made it to other replicas in weeks. This isn't a mere delay in replication.
Generally you'd need replication logging (errorloglevel 8192). But it's
very noisy
and can be hard to read. What you need to see is the ranges that they
agree to send.
Okay. I've done that and haven't had a chance to pore through them yet.
Also remember CSN's are a monotonic lamport clock. This means they only
ever advance
and can never step backwards. So they have some different properties to
what you may
expect. If they ever go backwards I think the replication handler throws
a pretty nasty
error.
I don't think it's going backwards. What I'm trying to rule out is that the replica is failing to advance its max CSN in the RUV being used to compare.
I *think* so. It's been a while since I had to look. The nsds50ruv shows
the ruv of
the server, and I think the other replica entries are "what the peers
ruv was last
time".
Well, it's at least nice to hear that my guess at least isn't asinine. :)
replication monitoring code in newer versions does this for you, so I'd
probably
advise you attempt to upgrade your environment. 1.3 is really old at
this point
I've been trying to get the current environment stable enough that I feel comfortable going through the relatively lengthy upgrade process. I think I'm going to have to adjust my comfort level.
I'm not sure if even RH or SUSE still support that version anymore).
RedHat does, as it's what's in RHEL7.9, which is supported for another, uh, 4 months. They're working on this with me. I'm still just trying to understand the system better so that I can try to be productive while I'm waiting on them to come up with ideas.
The problem here is that to read the RUV's and then compare them, you
need to read
each RUV from each server and then check if they are advancing (not that
they are equal).
The problem is that the changes in my environment are few enough that all the replicas' RUVs _are_ equal the majority of the time. I'm not in front of that system as I respond right now, so my details might be wrong, but I'm asking about all of this because every RUV I see in all of the replicas is the same, and it shows a max CSN for this one replica that's much older than the CSNs I see it reference in the logs about changes originating on the replica. The CSNs I see in the logs when a new change is made are referencing the current time in them, while the max CSN I see in the RUVs is from 4 months ago.
Maybe it *did* go backwards somehow and that's why it's not working. Not that that would really help me understand what actually went wrong any better than I do now.
If you want to assert that "Some change I made at CSN X is on all
servers" then
you would need to read and parse the ruv and ensure that all of them are
at or past that
CSN for that replica id.
Well, you'd think so. I've got that problem, too, where some CSNs just seem to get missed, but the max CSN in the RUV is well past that. But that's a different problem and not the one I'm working on now.
Thanks for the input.
-- William Faulk -- _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
Thanks, Pierre and Thierry.
After quite some time of poring over these debug logs, I've found some anomalies and they seem like they're matching up with the idea that the affected replica isn't updating its own RUV correctly.
The logs show a change being made, and it lists the CSN of the change. The first anomalies are here, but they probably aren't terribly significant. The CSN includes a timestamp, and the timestamp on this CSN is 11 hours into the future from when the change was made and logged. Also, the next part of the CSN is supposed to be a serial number for when there are changes made during the same second of the timestamp. In the case I was looking at, that serial was 0xb231. I'm certain that this replica didn't record another 45000 changes in that second.
Then it shows the server committing the change to the changelog. It shows it "processing data" for over 16000 other CSNs, and it takes about 25 seconds to complete.
It then starts a replication session with the peer and prints out the peer's (consumer's) RUV and then its own (supplier's) RUV. The RUV it prints out for itself shows the maxCSN for itself with a timestamp from almost 4 months ago. It is greater than the maxCSN for itself in the consumer's RUV, though, by a little. (The replicagenerations are equal, though.)
It then claims to send 7 changes, all of which are skipped because "empty". It then claims that there are "No more updates to send" and releases the consumer and eventually closes the connection.
I like the idea that there's a list of pending operations that's blocking RUV updates. Is there any way for me to examine this list? That said, I do think it updated its own maxCSN in its own RUV by a few hours. The peer I'm looking at does seem to reflect the increased maxCSN for the bad replica in the RUV I can see in the "mapping tree". I've tried to reproduce this small update, but haven't been able to yet.
I also have another replica that seems to be experiencing the same problem, and I've restarted it with no improvement in symptoms. It might be different, though. It doesn't look like it discarded its changelog.
I definitely don't relish reinitializing from this bad replica, though. I'd have to perform a rolling reinitialization throughout our whole environment, and it takes ages and a lot of effort.
On 2/29/24 21:31, William Faulk wrote:
Thanks, Pierre and Thierry.
After quite some time of poring over these debug logs, I've found some anomalies and they seem like they're matching up with the idea that the affected replica isn't updating its own RUV correctly.
The logs show a change being made, and it lists the CSN of the change. The first anomalies are here, but they probably aren't terribly significant. The CSN includes a timestamp, and the timestamp on this CSN is 11 hours into the future from when the change was made and logged. Also, the next part of the CSN is supposed to be a serial number for when there are changes made during the same second of the timestamp. In the case I was looking at, that serial was 0xb231. I'm certain that this replica didn't record another 45000 changes in that second.
Hi William,
Are you running DS on a VM, container, HW ? The fact that the CSN timestamp is some time in the future is not frequent but can happen. Generated CSN should always been increasing, so the generation of CSN ajust its timestamp with the received CSN. What looks weird is the number of serial number. Do you have a full error log sample where we can see sequence number moving to such high number (0xb231) ? C
Then it shows the server committing the change to the changelog. It shows it "processing data" for over 16000 other CSNs, and it takes about 25 seconds to complete.
It then starts a replication session with the peer and prints out the peer's (consumer's) RUV and then its own (supplier's) RUV. The RUV it prints out for itself shows the maxCSN for itself with a timestamp from almost 4 months ago. It is greater than the maxCSN for itself in the consumer's RUV, though, by a little. (The replicagenerations are equal, though.)
IIUC the consumer is currently catching up. Is the RUV, on the consumer, evolving ?
It then claims to send 7 changes, all of which are skipped because "empty". It then claims that there are "No more updates to send" and releases the consumer and eventually closes the connection.
Do you have fractional replication ? (some attributes are skipped from replication)
I like the idea that there's a list of pending operations that's blocking RUV updates. Is there any way for me to examine this list? That said, I do think it updated its own maxCSN in its own RUV by a few hours. The peer I'm looking at does seem to reflect the increased maxCSN for the bad replica in the RUV I can see in the "mapping tree". I've tried to reproduce this small update, but haven't been able to yet.
difficult to say. pending list has likely a different meaning in my understanding.
I also have another replica that seems to be experiencing the same problem, and I've restarted it with no improvement in symptoms. It might be different, though. It doesn't look like it discarded its changelog.
I definitely don't relish reinitializing from this bad replica, though. I'd have to perform a rolling reinitialization throughout our whole environment, and it takes ages and a lot of effort.
It's on a VM.
I don't have enough archived logs to show the progression of the serial number. However, I do have a text dump of the cldb, and I can filter it down to just the CSNs, and then to just the CSNs originated on this replica. The timestamp with the most CSNs is 752, and, of the 3323 unique timestamps, only 13 have more than 100 CSNs, only 267 have 10 or more, and 1299 are just a single change.
Here's the list, if you really want to look: https://pastebin.com/muegmwzV
I can't come up with a rationale for the numbers, honestly. They should just start at zero for each unique timestamp, right?
IIUC the consumer is currently catching up. Is the RUV, on the consumer, evolving ?
Based on the one set of debug logs, yes, but I'm not sure if that's an anomaly or not. I haven't been able to see it move since then, but I'm keeping an eye on it.
Do you have fractional replication ?
Yes. This is actually part of an IdM/FreeIPA installation, so the regular things that are stripped out there:
nsds5ReplicaStripAttrs: modifiersName modifyTimestamp internalModifiersName internalModifyTimestamp nsDS5ReplicatedAttributeList: (objectclass=*) $ EXCLUDE memberof idnssoaserial entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount nsDS5ReplicatedAttributeListTotal: (objectclass=*) $ EXCLUDE entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount
FYI: There is a list of pending operations to ensure that the RUV is not updated while an older operation is not yet completed. And I suspect that you hit a bug about this list. I remember that we fixed something in that area a few years ago ...
I think I found it, or something closely related.
there was an old RHEL-7.4 and RHEL-7.5 issue and fix in 1.3.5.10-20 replication halt - pending list first CSN not committed, pending list increasing https://bugzilla.redhat.com/1460070 https://github.com/389ds/389-ds-base/issues/2346 but you have a (somehow) more recent version, 389-ds-base-1.3.10.2-10.el7_9.x86_64 ( and one 389-ds-base-1.3.11.1-1.el7_9.x86_6 ) , so this should be fixed. the issue could have been related to sub operations. the logs provided do show a replica with a large and growing pending list, taking several seconds to parse (in double digits). and in this particular replication agreement, the consumer's max CSN is higher than the local supplier's one, so the remote replica/consumer is flagged as "ignored" then later in time some updates go through as the RUV in the R.A. have been updated by a better knowledgeable replica. but this seems to repeat (strange) I want to suggest deleting the changelog, and re-init that replica, but maybe Thierry or Pierre or William B. have a better suggestion. M.
On Thu, Feb 29, 2024 at 2:21 PM William Faulk d4hgcdgdmj@liamekaens.com wrote:
FYI: There is a list of pending operations to ensure that the RUV is not updated while an older operation is not yet completed. And I suspect that you hit a bug about this list. I remember that we fixed something in that area a few years ago ...
I think I found it, or something closely related.
https://github.com/389ds/389-ds-base/pull/4553
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
On 1 Mar 2024, at 10:47, Marc Sauton msauton@redhat.com wrote:
there was an old RHEL-7.4 and RHEL-7.5 issue and fix in 1.3.5.10-20 replication halt - pending list first CSN not committed, pending list increasing https://bugzilla.redhat.com/1460070 https://github.com/389ds/389-ds-base/issues/2346 but you have a (somehow) more recent version, 389-ds-base-1.3.10.2-10.el7_9.x86_64 ( and one 389-ds-base-1.3.11.1-1.el7_9.x86_6 ) , so this should be fixed. the issue could have been related to sub operations. the logs provided do show a replica with a large and growing pending list, taking several seconds to parse (in double digits). and in this particular replication agreement, the consumer's max CSN is higher than the local supplier's one, so the remote replica/consumer is flagged as "ignored" then later in time some updates go through as the RUV in the R.A. have been updated by a better knowledgeable replica. but this seems to repeat (strange) I want to suggest deleting the changelog, and re-init that replica, but maybe Thierry or Pierre or William B. have a better suggestion. M.
I'd want to see the layout of the topology and identify which is the errorneous server before we suggest a possible method of re-initialisation.
Remember, a re-init of the server db is resetting it's changelog, so depending on the setup you may be able to reinit just the one failing server and proceed from there. The issue is if it has changes that other nodes don't have.
-- Sincerely,
William Brown
Senior Software Engineer, Identity and Access Management SUSE Labs, Australia
One problem with reinitializing that replica is that since it's successfully receiving changes from everywhere else and not sending its changes outward, it's the only one that has the most up-to-date data.
For what it's worth, the topology is that at each of my PoPs, I have a pair of replicas that are replicating with each other, and each of the pair is replicating with one of the pair at the neighbor PoPs. The PoP topology is basically a ring of 9 PoPs, call them A through I. Then there are another two PoPs that connect A and E. Then there are leaf PoPs that hang off of B, C, H, and I.
If that's not clear, let me know and I can draw a diagram.
I think Pierre may refer to http://www.port389.org/docs/389ds/design/csn-pending-lists-and-ruv-update.ht...
https://pagure.io/389-ds-base/issue/49287
On 2/29/24 23:21, William Faulk wrote:
FYI: There is a list of pending operations to ensure that the RUV is not updated while an older operation is not yet completed. And I suspect that you hit a bug about this list. I remember that we fixed something in that area a few years ago ...
I think I found it, or something closely related.
https://github.com/389ds/389-ds-base/pull/4553
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
389-users@lists.fedoraproject.org