[389-devel] fractional replication monitoring proposal

Thu Oct 17 16:01:38 UTC 2013

Thanks everyone for your feedback!

Ok I have written an initial fix, and here is how it works and what I am 
seeing...

[1]  An update comes it and we update the local RUV.
[2]  We check this update against the fractional/stripped attrs in each 
agmt.
[3]  If this update does replicate to at least one agmt, we write a new 
attribute to the local ruv (currently call "nsds50replruv" - we can 
improve the names later).  If it doesn't replicate to any replicas then 
we don't update the new ruv attribute.  This all happens at the same 
time in write_changelog_and_ruv().  So there is no delay or copying of 
useless ruv info, and we write to the local RUV instead of a new RUV in 
cn=config(which I had originally proposed).

[4]  Here we made an update that is stripped by fractional replication:

Master A:

  ldapsearch -h localhost -D cn=dm -w password -b "dc=example,dc=com" 
-xLLL 
'(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' 
nsds50ruv nsds50replruv
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 
52583d80000000010000 52600339000000010000
nsds50replruv: {replica 1 ldap://localhost.localdomain:389} 
52583d80000000010000 5260030d000000010000
...

Master B

  ldapsearch -h localhost -D cn=dm -w password -b "dc=example,dc=com" 
-xLLL -p 22222 
'(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' 
nsds50ruv nsds50replruv
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 
52583d80000000010000 5260030d000000010000
nsds50replruv: {replica 1 ldap://localhost.localdomain:389} 
52583d80000000010000 5260030d000000010000
...

[5]  If we look at the "fractional" ruv (nsds50replruv) on Master A, it 
does correctly line up with the ruv on master B(nsds50ruv).
[6]  Then we make an update that does replicate, and now all the ruv's 
line up.

Master A

ldapsearch -h localhost -D cn=dm -w password -b "dc=example,dc=com" 
-xLLL 
'(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' 
nsds50ruv nsds50replruv
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 
52583d80000000010000 52600790000000010000
nsds50replruv: {replica 1 ldap://localhost.localdomain:389} 
52583d80000000010000 52600790000000010000

Master B

ldapsearch -h localhost -D cn=dm -w password -b "dc=example,dc=com" 
-xLLL -p 22222 
'(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectclass=nstombstone))' 
nsds50ruv nsds50replruv
dn: cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 
52583d80000000010000 52600790000000010000
nsds50replruv: {replica 1 ldap://localhost.localdomain:389} 
52583d80000000010000 52600790000000010000

There are still the same problems with fix, as I mentioned before, 
except we're not updating the dse config.  Now, I am concerned about the 
performance hit of checking to see if a mod gets "replicated".

As for the "sync" question, this fix does change how that behaves, or 
how repl-monitor already works.  It's either behind(by a certain amount 
of time), or in sync.  I'm not trying to improve the current repl status 
model.

Anyway, I just wanted to see if I could get this working.  Comments welcome.

Thanks again,
Mark

On 10/17/2013 05:44 AM, thierry bordaz wrote:
> On 10/17/2013 11:06 AM, Ludwig Krispenz wrote:
>>
>> On 10/17/2013 10:56 AM, thierry bordaz wrote:
>>> On 10/17/2013 10:49 AM, Ludwig Krispenz wrote:
>>>>
>>>> On 10/17/2013 10:15 AM, thierry bordaz wrote:
>>>>> On 10/16/2013 05:41 PM, Ludwig Krispenz wrote:
>>>>>>
>>>>>> On 10/16/2013 05:28 PM, Mark Reynolds wrote:
>>>>>>>
>>>>>>> On 10/16/2013 11:05 AM, Ludwig Krispenz wrote:
>>>>>>>>
>>>>>>>> On 10/15/2013 10:41 PM, Mark Reynolds wrote:
>>>>>>>>> https://fedorahosted.org/389/ticket/47368
>>>>>>>>>
>>>>>>>>> So we run into issues when trying to figure out if replicas 
>>>>>>>>> are in synch(if those replicas use fractional replication and 
>>>>>>>>> "strip mods").  What happens is that an update is made on 
>>>>>>>>> master A, but due to fractional replication there is no update 
>>>>>>>>> made to any replicas. So if you look at the ruv in the 
>>>>>>>>> tombstone entry on each server, it would appear they are out 
>>>>>>>>> of synch.  So using the ruv in the db tombstone is no longer 
>>>>>>>>> accurate when using fractional replication.
>>>>>>>>>
>>>>>>>>> I'm proposing a new ruv to be stored in the backend replica 
>>>>>>>>> entry: e.g. cn=replica,cn="dc=example,dc=com",cn=mapping 
>>>>>>>>> tree,cn=config. I'm calling this the "replicated ruv".  So 
>>>>>>>>> whenever we actually send an update to a replica, this ruv 
>>>>>>>>> will get updated.
>>>>>>>> I don't see how this will help, you have an additional info on 
>>>>>>>> waht has been replicated (which is available on the consumer as 
>>>>>>>> well) and you have a max csn, but you don't know if there are 
>>>>>>>> outstanding fractional changes to be sent.
>>>>>>> Well you will know on master A what operations get 
>>>>>>> replicated(this updates the new ruv before sending any changes), 
>>>>>>> and you can use this ruv to compare against the other master B's 
>>>>>>> ruv(in its replication agreement).   Maybe I am missing your point? 
>>>>>> MY point is that the question is, what is NOT yet replicated. 
>>>>>> Without fractional replication you have states of the ruv on all 
>>>>>> servers, and if ruv(A) > ruv(B) you know there are updates 
>>>>>> missing on B. With fractional, if (ruv(A) > ruv(B) this might be 
>>>>>> ok or not. If you keep an additional ruv on A when sending 
>>>>>> updates to be, you can only record what ws sent or attempted to 
>>>>>> send, but not what still has to be sent
>>>>>
>>>>> I agree with you Ludwig, but unless I missed something would not 
>>>>> be enough to know that the replica B is late or in sync ?
>>>>>
>>>>> For example, we have updates U1 U2 U3 and U4. U3 should be skipped 
>>>>> by fractional replication.
>>>>>
>>>>> replica RUV (tombstone) on master_A contains U4 and master_B 
>>>>> replica RUV contains U1.
>>>>> Let's assume that as initial value of the "replicated ruv" on 
>>>>> master_A we have U1.
>>>>> Starting a replication session, master_A should send U2 and update 
>>>>> the "replicated ruv" to U2.
>>>>> If the update is successfully applied on master_B, master_B 
>>>>> replica ruv is U2 and monitoring the two ruv shoud show they are 
>>>>> in sync.
>>>> They are not, since U4 is not yet replicated, in master_A you see 
>>>> the "normal" ruv as U4 and the "replicated" ruv as U2, but you 
>>>> don't know how many changes are between U2 and U4 an if any of them 
>>>> should be replicated, the replicated ruv is more or less a local 
>>>> copy of the remote ruv
>>>
>>> Yes I agree they are not this is a transient status. Transient 
>>> because the RA will continue going through the changelog until it 
>>> hits U4. At this point it will write U4 in the "replicated RUV" and 
>>> until master_B will apply U4 both server will appear out of sync.
>>> My understanding is that this "replicated RUV" only says it is in 
>>> sync or not, but does not address how far a server is out of sync 
>>> from the other (how many updates are missing). When you say it is 
>>> more or less a copy, it is exactly what it is. If it is a copy => in 
>>> sync, if it different => out of sync.
>> maybe we need to define what "in sync" means. For me in sync means 
>> both servers have the same set of updates applied.
>>
>> Forget fractional for a moment, if we have standard replication and 
>> master A is at U4 and master B is at U2, we say they are not in sync 
>> - or not ? You could keep a replicated ruv for thos as well, but this 
>> wouldn't change things.
>
> I agree we need to agree of what "in sync" means :-)
>
> I would prefer to speak of 'fractional ruv' (in place of 'replicated 
> ruv') for the new ruv proposed by Mark.
>  'replica ruv' being for the traditional ruv (tombstone) used in 
> standard replication.
>
> With  'replica ruv' we are in sync when the 'replica ruv' on both side 
> have the same value.
> With 'fractional ruv' we are in sync when the 'fractional ruv' on the 
> supplier and the 'replica ruv' have the same value.
>
> In fractional replication, we have updates U1, U2, U3 and U4. Let's U3 
> and U4 being skipped by fractional
> Let master_A 'replica ruv' is U4 and master_B 'replica ruv' is U2. And 
> no new updates.
> From a standard replication point of view they are out of sync, but 
> for fractional they are in sync.
>
> For fractional, how to know that that both masters are in sync. With 
> Mark solution 'fractional ruv' shows U2.
>
> Now a new update arrives U5 that is not skipped by fractional.
> master_A 'replicat ruv' is U5 and master_B 'replica ruv' is U2.
> until the replica agreement starts a new replication session, 
> 'fractional ruv' shows U2.
> The servers are shown 'in sync', because the RA has not yet started.
> From my understanding, the solution proposed by Mark has a drawback 
> where for a transient period (time to the RA to start its jobs, 
> evaluate and send U5, store it into the 'fractional ruv'), the servers 
> will appear 'in sync' although they are not. It could be an issue with 
> schedule replication but should be transient wrong status under normal 
> condition.
>
>>>
>>>>> If the update is not applierd, master_B replica ruv stays at U1 
>>>>> and the two ruv will show out of sync.
>>>>>
>>>>> In the first case, we have a transient status of 'in sync' because 
>>>>> the replica agreement will evaluate U3 then U4 then send U4 and 
>>>>> store it into the "replicated ruv". At this point master_A and 
>>>>> master_B will appear out of sync until master_B will apply U4.
>>>>> If U4 was to be skipped by fractional we have master_B ruv and 
>>>>> Master_A replicated ruv both showing U2 and that is correct both 
>>>>> servers are in sync.
>>>>>
>>>>> Mark instead of storing the replicated ruv in the replica, would 
>>>>> not be possible to store it into the replica agreement (one 
>>>>> replicated ruv per RA). So that it can solve the problem of 
>>>>> different fractional replication policy ?
>>>>>
>>>>>>> Do you mean changes that have not been read from the changelog 
>>>>>>> yet?  My plan was to update the new ruv in perform_operation() - 
>>>>>>> right after all the "stripping" has been done and there is 
>>>>>>> something to replicate.  We need to have a ruv for replicated 
>>>>>>> operations.
>>>>>>>
>>>>>>> I guess there are other scenarios I didn't think of, like if 
>>>>>>> replication is in a backoff state, and valid changes are coming 
>>>>>>> in.  Maybe, we could do test "stripping" earlier in the 
>>>>>>> replication process(when writing to the changelog?), and then 
>>>>>>> update the new ruv there instead of waiting until we try and 
>>>>>>> send the changes.
>>>>>>>>> Since we can not compare this "replicated ruv" to the replicas 
>>>>>>>>> tombstone ruv, we can instead compare the "replicated ruv" to 
>>>>>>>>> the ruv in the replica's repl agreement(unless it is a 
>>>>>>>>> dedicated consumer - here we might be able to still look at 
>>>>>>>>> the db tombstone ruv to determine the status).
>>>>>>>>>
>>>>>>>>> Problems with this approach:
>>>>>>>>>
>>>>>>>>> -  All the servers need to have the same replication 
>>>>>>>>> configuration(the same fractional replication policy and 
>>>>>>>>> attribute stripping) to give accurate results.
>>>>>>>>>
>>>>>>>>> -  If one replica has an agreement that does NOT filter the 
>>>>>>>>> updates, but has agreements that do filter updates, then we 
>>>>>>>>> can not correctly determine its synchronization state with the 
>>>>>>>>> fractional replicas.
>>>>>>>>>
>>>>>>>>> -  Performance hit from updating another ruv(in cn=config)?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Fractional replication simply breaks our monitoring process.  
>>>>>>>>> I'm not sure, not without updating the repl protocol, that we 
>>>>>>>>> can cover all deployment scenarios(mixed fractional repl 
>>>>>>>>> agmts, etc). However, I "think" this approach would work for 
>>>>>>>>> most deployments(compared to none at the moment).  For IPA, 
>>>>>>>>> since they don't use consumers, this approach would work for 
>>>>>>>>> them. And finally, all of this would have to be handled by a 
>>>>>>>>> updated version of repl-monitor.pl.
>>>>>>>>>
>>>>>>>>> This is just my preliminary idea on how to handle this.  
>>>>>>>>> Feedback is welcome!!
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>> Mark
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Mark Reynolds
>>>>>>>>> 389 Development Team
>>>>>>>>> Red Hat, Inc
>>>>>>>>> mreynolds at redhat.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> 389-devel mailing list
>>>>>>>>> 389-devel at lists.fedoraproject.org
>>>>>>>>> https://admin.fedoraproject.org/mailman/listinfo/389-devel
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> 389-devel mailing list
>>>>>>>> 389-devel at lists.fedoraproject.org
>>>>>>>> https://admin.fedoraproject.org/mailman/listinfo/389-devel
>>>>>>>
>>>>>>> -- 
>>>>>>> Mark Reynolds
>>>>>>> 389 Development Team
>>>>>>> Red Hat, Inc
>>>>>>> mreynolds at redhat.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> 389-devel mailing list
>>>>>> 389-devel at lists.fedoraproject.org
>>>>>> https://admin.fedoraproject.org/mailman/listinfo/389-devel
>>>>>
>>>>
>>>
>>
>

-- 
Mark Reynolds
389 Development Team
Red Hat, Inc
mreynolds at redhat.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.fedoraproject.org/pipermail/389-devel/attachments/20131017/bc4c20e1/attachment.html>