[389-users] Re: A more profound replication monitoring of 389-ds instance

Friday, 21 April 2023

Hi,

I agree that it is complex task to master such FreeIPA deployment. 
FreeIPA enables many components, 389ds is just one of them, and several 
of them could contribute when a problem occurs. My main concern here is 
that you express a need to monitor (how well FreeIPA deployment works) 
rather than pointing a clear misbehavior of the topology that we could 
focus on.

Replication is an important functionality of 389ds/FreeIPA and 
monitoring replication is a common demand. One common demand is to 
monitor replication lag (how much time it takes for the topology to 
converge: an update is replicated on all replicas). There are several 
ways to monitor that but I think an easy way is to rely on dirsrv access 
logs. Each replicated update is identified uniquely with its 'csn' and 
you can find csn like 'csn=57eb7dbc000000600000'. Then grepping this 
value among all access logs on all instances, could give you an 
indication of lag. Lag could typically be in the range of seconds or 1-2 
minutes, if it spikes to many minutes or never hit some replica then you 
could start investigating why it is slow.
A difficulty with that procedure is that some updates are not replicated 
(fractional replication).
Investigation in replication are quite complex and difficult to explain 
in general. This forum is a good place to get answers to specific questions.

best regards
Thierry

On 4/20/23 16:10, dweller dweller wrote:
...
 Yes, I'll try to explain my needs more clearly. As it happens a
lot I recently inherited a FreeIPA installation and am now responsible for managing the
service. As someone who was not previously familiar with FreeIPA, I am in the process of
building my expertise in managing it.
 When I started the monitoring setup was represented with node_exporter, process_exporter
for the host and 389ds_exporter (https://github.com/terrycain/389ds_exporter) for the ldap
data. However, as the FreeIPA installation grew in size, we started encountering issues
and realized that we lacked critical information to pinpoint the root causes of these
problems. To address this, I have taken steps to improve the monitoring setup. I have
started monitoring FreeIPA's bind service using a separate_exporter and exporting DNS
queries to opensearch. Additionally, I have rewritten the 389ds_exporter to include
cn=monitor metrics to provide more visibility into the 389 Directory Server.

 I recently realized that I could also include 'cn=ldbm database' metrics, which
are low-level but could be useful in troubleshooting the issues we are facing. The
problems we are encountering are related to disk IO, and having these metrics could
provide valuable insights into the following:

 1) Excessive paging out and increased swap usage without spikes in load. For example
after restarting of replica the swap usage increases to 30% (of 3GB swap space) over 1-2
days while there are at least 4GB of availiable RAM present on the host. And the general
swap consumer is ns-slapd service. For now I only tested to configure swappiness parametr
to zero, which did not help, so I guess there are some other factors involved.

 2) Spikes in IO latency observed during modifying and adding operations, which were not
present when the cluster was smaller (up to 10 replicas). I need to determine whether the
issue lies with service tuning or with the cloud provider and its SAN, as we recently
migrated to SSD disks without improvement. As I said about "replication lag"
those problems just started more appearing as new replicas were added, but for now we
mostly observe it by outage of services that rely on ldap. The "waves" refers to
the way problem apprear, as different clients VDCs are having problems one after the other
which is looks like replication propagation.

 3) Master-master replication just seems to me as a big "black cloud", which I
have no control or knowledge of. When you have couple of hosts it is maybe fine to rely on
documented way of looking up replicationStatus attribute, but when you have couple of
dozens I guess things could get quite not so straitforward, at least relying on intuition
suggests it. When I say about replication observability what I mean and what I'd like
to see is following:

 Graph representation...

 - ...of time it took to replay a change (or I guess time of full replication session)
 - ...of the amount simutanialous connections that Suppliers trying to establish with
Consumer
 - ...of time spent waiting to acquire replica access

 I just pointed a few of the top of my head. I don't know for sure (and first post was
about it) is it really worth it to try and get those kind of metrics or I just don't
know what I'm talking about and it would be a waste of time and hard to implement. As
I mentioned bpf cause I see it as only option I could get it, the other option is to parse
logs that are in DEBUG mode which is not the option.

 With replication metrics besides the ability to see its impact on the problems above,
I'm also trying to solve more administrative task - I need to convince the
architerture departament to change the model of adding new replicas. Right now we kinda
adding two replicas for every new client.

        +------------------------------+
        |           client#1           |
        |              VDC             |
        |                              |
        |       +--------------+       |       +---------------------+                
+---------------------+
        |       |              +-------------->+                    
+---------------->+                     |   ...
        |       |  replica-01  |       |       |  common-replica-01  |                 | 
common-replica-02  |
        |       |              +<--------------+                    
+<----------------+                     |
        |       +--------------+       |       +---------------------+                
+---------------------+
        |            |    ^            |                | ^                               
      | ^
        |            v    |            |                | |                               
      | |
        |       +--------------+       |                | |                               
      | |
        |       |              |       |                | |                               
      | |
        |       |  replica-02  |       |                | |                               
      | |
        |       |              |       |                | |                               
      | |
        |       +--------------+       |                | |                               
      | |
        |                              |                | |                               
      | |
        |                              |                v |                               
      v |
        +------------------------------+       +---------------------+                
+---------------------+
                                               |                    
+---------------->+                     |
                                               |  common-replica-03  |                 | 
common-replica-04  |
                                               |                    
+<----------------+                     |
                                               +---------------------+                
+---------------------+

 Which is not ideal at all (and as I said we started to face problems). From their side
the answer is that they are following documentation restrictions for no more than 4
replica agreements for replica and no more than 60 simutanialous replicas in master-master
replications. And for now this is indeed being followed and I need to come with deeper
analysis or find that problem lies in fine tuning the service.

 So it's kind mishmash of everything at the same time, hope I answered you question.

 best regards,
 v.zh
 _______________________________________________
 389-users mailing list --389-users(a)lists.fedoraproject.org
 To unsubscribe send an email to389-users-leave(a)lists.fedoraproject.org
 Fedora Code of Conduct:https://docs.fedoraproject.org/en-US/project/code-of-conduct/
 List Guidelines:https://fedoraproject.org/wiki/Mailing_list_guidelines
 List
Archives:https://lists.fedoraproject.org/archives/list/389-users@lists.fe...
 Do not reply to spam, report it:https://pagure.io/fedora-infrastructure/new_issue 

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[389-users] Re: A more profound replication monitoring of 389-ds instance