Re: Chain on Update problem
by Grant Byers
Confirmed. I made the following simple change and it allows cb_ping_farm to work with anonymous binds only enabled for rootdse;
diff -urN a/ldap/servers/plugins/chainingdb/cb_conn_stateless.c b/ldap/servers/plugins/chainingdb/cb_conn_stateless.c
--- a/ldap/servers/plugins/chainingdb/cb_conn_stateless.c 2020-03-17 04:52:57.000000000 +1000
+++ b/ldap/servers/plugins/chainingdb/cb_conn_stateless.c 2021-03-08 14:04:48.413647052 +1000
@@ -883,7 +883,7 @@
/* NOTE: This will fail if we implement the ability to disable
anonymous bind */
- rc = ldap_search_ext_s(ld, NULL, LDAP_SCOPE_BASE, "objectclass=*", attrs, 1, NULL,
+ rc = ldap_search_ext_s(ld, "", LDAP_SCOPE_BASE, "objectclass=*", attrs, 1, NULL,
NULL, &timeout, 1, &result);
if (LDAP_SUCCESS != rc) {
slapi_ldap_unbind(ld);
I don't believe this will break any functionality, but since we're running RHEL7, i'll raise this with Red Hat directly and they can review and/or push upstream.
Regards,
Grant
________________________________
From: Grant Byers <Grant.Byers(a)aarnet.edu.au>
Sent: Monday, March 8, 2021 12:27 PM
To: 389-users(a)lists.fedoraproject.org <389-users(a)lists.fedoraproject.org>
Subject: Re: [389-users] Re: Chain on Update problem
Thanks.
I have tested various combinations of the tuning params without success. I've done further debugging and confirmed that it always starts after a bind operation timeout. Looking into the chaining plugin code, I see that on operation timeout results in a call to cb_ping_farm to see if we can find another server in the pool that is available. However, it performs this operation (the comment is telling);
/* NOTE: This will fail if we implement the ability to disable
anonymous bind */
rc = ldap_search_ext_s(ld, NULL, LDAP_SCOPE_BASE, "objectclass=*", attrs, 1, NULL,
NULL, &timeout, 1, &result);
if (LDAP_SUCCESS != rc) {
slapi_ldap_unbind(ld);
cb_update_failed_conn_cpt(cb);
return LDAP_SERVER_DOWN;
}
So basically, because we've disallowed anonymous bind for anything but rootdse, it will always fail to find another available server. I have confirmed this by allowing anonymous bind on our masters while the issue was present, then subsequent binds on the consumers start working again.
I would think it more appropriate for that code to do a search against the rootdse instead. Is there any good reason why it shouldn't? If not, I might test modifying it.
Thanks,
Grant
________________________________
From: William Brown <wbrown(a)suse.de>
Sent: Friday, March 5, 2021 3:52 PM
To: 389-users(a)lists.fedoraproject.org <389-users(a)lists.fedoraproject.org>
Subject: [389-users] Re: Chain on Update problem
[External Mail]
> On 5 Mar 2021, at 12:03, Grant Byers <Grant.Byers(a)aarnet.edu.au> wrote:
>
> Hi All,
>
> Version: 1.3.10
>
> In our environment, we'd like to use a chaining backend to push BIND operations up to masters by way of the consumer (rather than client referral). We'd like to do this to ensure password lockout attributes are propagated to all consumers equally via our standard replication agreements. This is described here - https://directory.fedoraproject.org/docs/389ds/howto/howto-chainonupdate.....
>
> NOTE, we do not have hubs in our topology. Just masters and consumers, so no intermediate chaining.
>
> We tested this process in our environment and it worked beautifully until we took it to production. Currently, we have just 2 masters and they are both sitting on some over-subscribed hardware that suffers from I/O starvation at certain times of the day. The plan is to scale out our masters eventually, but we're a little hamstrung with other projects and priorities. It worked extremely well until that time of day when masters suffered from I/O starvation, and hence, very long I/O wait times. This is generally short lived and happens at alternate times of the day for each of the masters. However, it seems that once both nsfarmservers have "failed", there is never any attempt by the consumer to retry them. This leads to bind errors as follows;
>
> ldapwhoami -x -D "<binddn>" -W
> Enter LDAP Password:
> ldap_bind: Operations error (1)
> additional info: FARM SERVER TEMPORARY UNAVAILABLE
>
> Except it is not temporary. It never recovers, even though all members of nsfarmservers are now healthy again (and are never unhealthy at the same time). We can confirm this by performing binds from the consumers directly against the masters. I thought that setting nsConnectionLife to something larger than 0 (indefinite) would help this, but it has not.
The chain on update appears to use the chaining plugin timeouts, so you could look at adjusting these parameters which may help.
nsBindTimeout
nsOperationTimeout
nsBindRetryLimit
nsMaxResponseDelay
nsMaxTestResponseDelay
>
> Is this by design, a bug, or an implementation fault on my behalf? Configuration below;
>
> Thanks,
> Grant
>
>
>
> ## On masters, create a dedicated user for chaining backend
> dn: cn=proxy,cn=config
> objectClass: person
> objectClass: top
> cn: proxy
> sn: admin
>
> ## On all consumers, create chaining backend;
> dn: cn=chainbe1,cn=chaining database,cn=plugins,cn=config
> objectclass: top
> objectclass: extensibleObject
> objectclass: nsBackendInstance
> nsslapd-suffix: <suffix>
> nsfarmserverurl: ldaps://<master1>:636 <master2>:636/
> nsMultiplexorBindDN: <binddn>>
> nsMultiplexorCredentials: <bindpw>
> nsCheckLocalACI: on
> nsConnectionLife: 30
> cn: chainbe1
>
> ## On all consumers, add the backend and repl_chain_on_update function
> dn: cn="<suffix>",cn=mapping tree,cn=config
> changetype: modify
> add: nsslapd-backend
> nsslapd-backend: chainbe1
> -
> add: nsslapd-distribution-plugin
> nsslapd-distribution-plugin: libreplication-plugin
> -
> add: nsslapd-distribution-funct
> nsslapd-distribution-funct: repl_chain_on_update
>
> ## On all servers, enable global pasword policy
> dn: cn=config
> changetype: modify
> replace: passwordIsGlobalPolicy
> passwordIsGlobalPolicy: on
>
> _______________________________________________
> 389-users mailing list -- 389-users(a)lists.fedoraproject.org
> To unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproje...
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
—
Sincerely,
William Brown
Senior Software Engineer, 389 Directory Server
SUSE Labs, Australia
_______________________________________________
389-users mailing list -- 389-users(a)lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproje...
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
3 years, 1 month
Re: Chain on Update problem
by William Brown
> On 5 Mar 2021, at 12:03, Grant Byers <Grant.Byers(a)aarnet.edu.au> wrote:
>
> Hi All,
>
> Version: 1.3.10
>
> In our environment, we'd like to use a chaining backend to push BIND operations up to masters by way of the consumer (rather than client referral). We'd like to do this to ensure password lockout attributes are propagated to all consumers equally via our standard replication agreements. This is described here - https://directory.fedoraproject.org/docs/389ds/howto/howto-chainonupdate.....
>
> NOTE, we do not have hubs in our topology. Just masters and consumers, so no intermediate chaining.
>
> We tested this process in our environment and it worked beautifully until we took it to production. Currently, we have just 2 masters and they are both sitting on some over-subscribed hardware that suffers from I/O starvation at certain times of the day. The plan is to scale out our masters eventually, but we're a little hamstrung with other projects and priorities. It worked extremely well until that time of day when masters suffered from I/O starvation, and hence, very long I/O wait times. This is generally short lived and happens at alternate times of the day for each of the masters. However, it seems that once both nsfarmservers have "failed", there is never any attempt by the consumer to retry them. This leads to bind errors as follows;
>
> ldapwhoami -x -D "<binddn>" -W
> Enter LDAP Password:
> ldap_bind: Operations error (1)
> additional info: FARM SERVER TEMPORARY UNAVAILABLE
>
> Except it is not temporary. It never recovers, even though all members of nsfarmservers are now healthy again (and are never unhealthy at the same time). We can confirm this by performing binds from the consumers directly against the masters. I thought that setting nsConnectionLife to something larger than 0 (indefinite) would help this, but it has not.
The chain on update appears to use the chaining plugin timeouts, so you could look at adjusting these parameters which may help.
nsBindTimeout
nsOperationTimeout
nsBindRetryLimit
nsMaxResponseDelay
nsMaxTestResponseDelay
>
> Is this by design, a bug, or an implementation fault on my behalf? Configuration below;
>
> Thanks,
> Grant
>
>
>
> ## On masters, create a dedicated user for chaining backend
> dn: cn=proxy,cn=config
> objectClass: person
> objectClass: top
> cn: proxy
> sn: admin
>
> ## On all consumers, create chaining backend;
> dn: cn=chainbe1,cn=chaining database,cn=plugins,cn=config
> objectclass: top
> objectclass: extensibleObject
> objectclass: nsBackendInstance
> nsslapd-suffix: <suffix>
> nsfarmserverurl: ldaps://<master1>:636 <master2>:636/
> nsMultiplexorBindDN: <binddn>>
> nsMultiplexorCredentials: <bindpw>
> nsCheckLocalACI: on
> nsConnectionLife: 30
> cn: chainbe1
>
> ## On all consumers, add the backend and repl_chain_on_update function
> dn: cn="<suffix>",cn=mapping tree,cn=config
> changetype: modify
> add: nsslapd-backend
> nsslapd-backend: chainbe1
> -
> add: nsslapd-distribution-plugin
> nsslapd-distribution-plugin: libreplication-plugin
> -
> add: nsslapd-distribution-funct
> nsslapd-distribution-funct: repl_chain_on_update
>
> ## On all servers, enable global pasword policy
> dn: cn=config
> changetype: modify
> replace: passwordIsGlobalPolicy
> passwordIsGlobalPolicy: on
>
> _______________________________________________
> 389-users mailing list -- 389-users(a)lists.fedoraproject.org
> To unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproje...
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
—
Sincerely,
William Brown
Senior Software Engineer, 389 Directory Server
SUSE Labs, Australia
3 years, 1 month
Re: [EXTERNAL] Re: Replication delay, connection blocking ending in closed - B1
by Pierre Rogier
Hi Colin,
The important point in what you describe is that master2 keeps a
replication session open
(i.e: no ext 2.16.840.1.113730.3.5.5) without sending any updates
So the problem is either on master2 or on network
my guess is that master2 has trouble to determine the next change to send.
A possible scenario is that there is a huge changelog and it is walked
from the start
(as described in github issue 4644 ...)
Regards,
Pierre
On Thu, Mar 4, 2021 at 8:36 PM Colin Tulloch <Colin.Tulloch(a)entrust.com>
wrote:
> I tried it just now but it was way too verbose - we filled up 500mb of
> error logs in 15 minutes.
>
> We have a lot more space, but it could take hours before we see another
> failure (these delays intermittently cause an application to fail).
> Unfortunately we can't re-produce on demand.
>
>
> -----Original Message-----
> From: William Brown [mailto:wbrown@suse.de]
> Sent: Wednesday, March 03, 2021 8:38 PM
> To: 389-users(a)lists.fedoraproject.org
> Subject: [EXTERNAL] [389-users] Re: Replication delay, connection blocking
> ending in closed - B1
>
> WARNING: This email originated outside of Entrust.
> DO NOT CLICK links or attachments unless you trust the sender and know the
> content is safe.
>
> ______________________________________________________________________
> Can you turn on replication logging? I think it's level 8192 in the
> errorlog-level.
>
> > On 4 Mar 2021, at 11:12, Colin Tulloch <Colin.Tulloch(a)entrust.com>
> wrote:
> >
> > Hello –
> >
> > We are seeing an issue where changes can be very slow to replicate to
> one of our consumers (up to 15m+). We have a large topology, but in this
> case the issue is isolated between 2 masters that replicate to 1 consumer.
> >
> > In one example entry addition I found, it appears that we see;
> >
> > - connection from master1 to consumer1, bunch of changes pushed
> > - then master2 connects to consumer1
> > - master1 stops doing changes but stays connected
> > - master2 does literally 1 change (success, no issues), stays
> connected
> > - 16 minutes goes by, no additional changes or replication EXT
> ops for that connection are done (master1->consumer1 EXT ops continue
> normally…)
> > - after that long pause, master2 disconnects - B1 bad BER tag
> code
> > - and then Master1 resumes making tons of changes
> >
> > Searches in this DB and others continue to take place, and changes in
> other DBs. So it wasn’t as if the server was unresponsive/hung. It is
> almost as if that DB went “read-only” for a time – I’m unable to tell if
> something else besides replication was attempting but unable to make writes
> though.
> >
> >
> > Anyone see something like this before? We see lots of B1 codes
> randomly, I’ve never understood what may cause that – the description of
> corruption/physical network problems does not make much sense. Maybe if
> our directories were replicating to eachother over the internet, or using
> Wifi….
> >
> > The time it takes for that connection to end in the B1 doesn’t seem to
> > line up with any dirsrv OR system/TCP timeouts either
> >
> >
> > Nothing illuminating in the error logs really.
> >
> > Log snips of this happening;
> >
> > [03/Mar/2021:13:50:34.395939343 -0600] conn=270076 fd=280 slot=280
> > connection from master1 to consumer1 ...
> > [03/Mar/2021:13:50:34.608149165 -0600] conn=270076 op=13 MOD
> dn="cn=CRLblahblah,c=US"
> > [03/Mar/2021:13:50:34.609565373 -0600] conn=270076 op=13 RESULT err=0
> > tag=103 nentries=0 etime=0.001451229 csn=603febdd00035aae0000
> > [03/Mar/2021:13:50:34.818645762 -0600] conn=270076 op=14 EXT
> oid="2.16.840.1.113730.3.5.5" name="replication-multimaster-extop"
> > [03/Mar/2021:13:50:34.820544342 -0600] conn=270076 op=14 RESULT err=0
> > tag=120 nentries=0 etime=0.002004143
> >
> > [03/Mar/2021:13:50:34.460210520 -0600] conn=270077 fd=307 slot=307
> > connection from master2 to consumer1
> > [03/Mar/2021:13:50:34.460676562 -0600] conn=270077 op=0 BIND
> > dn="cn=replication manager,cn=config" method=128 version=3
> > [03/Mar/2021:13:50:34.460992487 -0600] conn=270077 op=0 RESULT err=0
> tag=97 nentries=0 etime=0.000376400 dn="cn=replication manager,cn=config"
> > <snipped replication startup jargon>
> > [03/Mar/2021:13:50:35.715331361 -0600] conn=270077 op=5 MOD
> dn="cn=CRLblahblah,c=US"
> > [03/Mar/2021:13:50:35.717330540 -0600] conn=270077 op=5 RESULT err=0
> > tag=103 nentries=0 etime=0.002066162 csn=603febdd00045aae0000 ...
> > [03/Mar/2021:13:50:36.828647054 -0600] conn=270076 op=15 EXT
> oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> > [03/Mar/2021:13:50:36.828888493 -0600] conn=270076 op=15 RESULT err=0
> > tag=120 nentries=0 etime=0.000368293
> > [03/Mar/2021:13:50:37.112334122 -0600] conn=270076 op=16 EXT
> oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> > [03/Mar/2021:13:50:37.112782309 -0600] conn=270076 op=16 RESULT err=0
> > tag=120 nentries=0 etime=0.000624719
> > [03/Mar/2021:13:50:38.113792312 -0600] conn=270076 op=17 EXT
> oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> > [03/Mar/2021:13:50:38.114092751 -0600] conn=270076 op=17 RESULT err=0
> > tag=120 nentries=0 etime=0.000476209 … continued EXT ops on
> > conn=270076, from master1 …
> > [03/Mar/2021:14:06:07.872623403 -0600] conn=270077 op=-1 fd=307 closed
> > - B1
> > [03/Mar/2021:14:06:07.916640931 -0600] conn=270076 op=1106 EXT
> oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> > [03/Mar/2021:14:06:07.917191372 -0600] conn=270076 op=1106 RESULT
> > err=0 tag=120 nentries=0 etime=0.000662946
> > [03/Mar/2021:14:06:07.921837193 -0600] conn=270076 op=1107 MOD
> dn="cn=CRLblahblah,c=US"
> > [03/Mar/2021:14:06:07.923754219 -0600] conn=270076 op=1107 RESULT
> > err=0 tag=103 nentries=0 etime=0.002036469 csn=603febdd00065aae0000
> > and a flood of changes now ...
> >
> >
> > Colin Tulloch
> > Architect, USmPKI
> > colin.tulloch(a)entrust.com
> >
> > _______________________________________________
> > 389-users mailing list -- 389-users(a)lists.fedoraproject.org To
> > unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org
> > Fedora Code of Conduct:
> > https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/proje
> > ct/code-of-conduct/__;!!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRCX7rW2IFhPkZirblU2W
> > u8kUbk3We1GkGmYpPtcGYtadjVcbLfQh5Y$
> > List Guidelines:
> > https://urldefense.com/v3/__https://fedoraproject.org/wiki/Mailing_lis
> > t_guidelines__;!!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRCX7rW2IFhPkZirblU2Wu8kUbk3
> > We1GkGmYpPtcGYtadjVcTvoSprg$ List Archives:
> > https://urldefense.com/v3/__https://lists.fedoraproject.org/archives/l
> > ist/389-users(a)lists.fedoraproject.org__;!!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRC
> > X7rW2IFhPkZirblU2Wu8kUbk3We1GkGmYpPtcGYtadjVcbxMY_7c$
> > Do not reply to spam on the list, report it:
> > https://urldefense.com/v3/__https://pagure.io/fedora-infrastructure__;
> > !!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRCX7rW2IFhPkZirblU2Wu8kUbk3We1GkGmYpPtcGYt
> > adjVcGTRcLlY$
>
> —
> Sincerely,
>
> William Brown
>
> Senior Software Engineer, 389 Directory Server SUSE Labs, Australia
> _______________________________________________
> 389-users mailing list -- 389-users(a)lists.fedoraproject.org To
> unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org
> Fedora Code of Conduct:
> https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/project/...
> List Guidelines:
> https://urldefense.com/v3/__https://fedoraproject.org/wiki/Mailing_list_g...
> List Archives:
> https://urldefense.com/v3/__https://lists.fedoraproject.org/archives/list...
> Do not reply to spam on the list, report it:
> https://urldefense.com/v3/__https://pagure.io/fedora-infrastructure__;!!F...
> _______________________________________________
> 389-users mailing list -- 389-users(a)lists.fedoraproject.org
> To unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org
> Fedora Code of Conduct:
> https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives:
> https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproje...
> Do not reply to spam on the list, report it:
> https://pagure.io/fedora-infrastructure
>
--
--
389 Directory Server Development Team
3 years, 1 month
Chain on Update problem
by Grant Byers
Hi All,
Version: 1.3.10
In our environment, we'd like to use a chaining backend to push BIND operations up to masters by way of the consumer (rather than client referral). We'd like to do this to ensure password lockout attributes are propagated to all consumers equally via our standard replication agreements. This is described here - https://directory.fedoraproject.org/docs/389ds/howto/howto-chainonupdate.....
NOTE, we do not have hubs in our topology. Just masters and consumers, so no intermediate chaining.
We tested this process in our environment and it worked beautifully until we took it to production. Currently, we have just 2 masters and they are both sitting on some over-subscribed hardware that suffers from I/O starvation at certain times of the day. The plan is to scale out our masters eventually, but we're a little hamstrung with other projects and priorities. It worked extremely well until that time of day when masters suffered from I/O starvation, and hence, very long I/O wait times. This is generally short lived and happens at alternate times of the day for each of the masters. However, it seems that once both nsfarmservers have "failed", there is never any attempt by the consumer to retry them. This leads to bind errors as follows;
ldapwhoami -x -D "<binddn>" -W
Enter LDAP Password:
ldap_bind: Operations error (1)
additional info: FARM SERVER TEMPORARY UNAVAILABLE
Except it is not temporary. It never recovers, even though all members of nsfarmservers are now healthy again (and are never unhealthy at the same time). We can confirm this by performing binds from the consumers directly against the masters. I thought that setting nsConnectionLife to something larger than 0 (indefinite) would help this, but it has not.
Is this by design, a bug, or an implementation fault on my behalf? Configuration below;
Thanks,
Grant
## On masters, create a dedicated user for chaining backend
dn: cn=proxy,cn=config
objectClass: person
objectClass: top
cn: proxy
sn: admin
## On all consumers, create chaining backend;
dn: cn=chainbe1,cn=chaining database,cn=plugins,cn=config
objectclass: top
objectclass: extensibleObject
objectclass: nsBackendInstance
nsslapd-suffix: <suffix>
nsfarmserverurl: ldaps://<master1>:636 <master2>:636/
nsMultiplexorBindDN: <binddn>>
nsMultiplexorCredentials: <bindpw>
nsCheckLocalACI: on
nsConnectionLife: 30
cn: chainbe1
## On all consumers, add the backend and repl_chain_on_update function
dn: cn="<suffix>",cn=mapping tree,cn=config
changetype: modify
add: nsslapd-backend
nsslapd-backend: chainbe1
-
add: nsslapd-distribution-plugin
nsslapd-distribution-plugin: libreplication-plugin
-
add: nsslapd-distribution-funct
nsslapd-distribution-funct: repl_chain_on_update
## On all servers, enable global pasword policy
dn: cn=config
changetype: modify
replace: passwordIsGlobalPolicy
passwordIsGlobalPolicy: on
3 years, 1 month
Re: Replication delay, connection blocking ending in closed - B1
by William Brown
Can you turn on replication logging? I think it's level 8192 in the errorlog-level.
> On 4 Mar 2021, at 11:12, Colin Tulloch <Colin.Tulloch(a)entrust.com> wrote:
>
> Hello –
>
> We are seeing an issue where changes can be very slow to replicate to one of our consumers (up to 15m+). We have a large topology, but in this case the issue is isolated between 2 masters that replicate to 1 consumer.
>
> In one example entry addition I found, it appears that we see;
>
> - connection from master1 to consumer1, bunch of changes pushed
> - then master2 connects to consumer1
> - master1 stops doing changes but stays connected
> - master2 does literally 1 change (success, no issues), stays connected
> - 16 minutes goes by, no additional changes or replication EXT ops for that connection are done (master1->consumer1 EXT ops continue normally…)
> - after that long pause, master2 disconnects - B1 bad BER tag code
> - and then Master1 resumes making tons of changes
>
> Searches in this DB and others continue to take place, and changes in other DBs. So it wasn’t as if the server was unresponsive/hung. It is almost as if that DB went “read-only” for a time – I’m unable to tell if something else besides replication was attempting but unable to make writes though.
>
>
> Anyone see something like this before? We see lots of B1 codes randomly, I’ve never understood what may cause that – the description of corruption/physical network problems does not make much sense. Maybe if our directories were replicating to eachother over the internet, or using Wifi….
>
> The time it takes for that connection to end in the B1 doesn’t seem to line up with any dirsrv OR system/TCP timeouts either
>
>
> Nothing illuminating in the error logs really.
>
> Log snips of this happening;
>
> [03/Mar/2021:13:50:34.395939343 -0600] conn=270076 fd=280 slot=280 connection from master1 to consumer1
> ...
> [03/Mar/2021:13:50:34.608149165 -0600] conn=270076 op=13 MOD dn="cn=CRLblahblah,c=US"
> [03/Mar/2021:13:50:34.609565373 -0600] conn=270076 op=13 RESULT err=0 tag=103 nentries=0 etime=0.001451229 csn=603febdd00035aae0000
> [03/Mar/2021:13:50:34.818645762 -0600] conn=270076 op=14 EXT oid="2.16.840.1.113730.3.5.5" name="replication-multimaster-extop"
> [03/Mar/2021:13:50:34.820544342 -0600] conn=270076 op=14 RESULT err=0 tag=120 nentries=0 etime=0.002004143
>
> [03/Mar/2021:13:50:34.460210520 -0600] conn=270077 fd=307 slot=307 connection from master2 to consumer1
> [03/Mar/2021:13:50:34.460676562 -0600] conn=270077 op=0 BIND dn="cn=replication manager,cn=config" method=128 version=3
> [03/Mar/2021:13:50:34.460992487 -0600] conn=270077 op=0 RESULT err=0 tag=97 nentries=0 etime=0.000376400 dn="cn=replication manager,cn=config"
> <snipped replication startup jargon>
> [03/Mar/2021:13:50:35.715331361 -0600] conn=270077 op=5 MOD dn="cn=CRLblahblah,c=US"
> [03/Mar/2021:13:50:35.717330540 -0600] conn=270077 op=5 RESULT err=0 tag=103 nentries=0 etime=0.002066162 csn=603febdd00045aae0000
> ...
> [03/Mar/2021:13:50:36.828647054 -0600] conn=270076 op=15 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> [03/Mar/2021:13:50:36.828888493 -0600] conn=270076 op=15 RESULT err=0 tag=120 nentries=0 etime=0.000368293
> [03/Mar/2021:13:50:37.112334122 -0600] conn=270076 op=16 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> [03/Mar/2021:13:50:37.112782309 -0600] conn=270076 op=16 RESULT err=0 tag=120 nentries=0 etime=0.000624719
> [03/Mar/2021:13:50:38.113792312 -0600] conn=270076 op=17 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> [03/Mar/2021:13:50:38.114092751 -0600] conn=270076 op=17 RESULT err=0 tag=120 nentries=0 etime=0.000476209
> …
> continued EXT ops on conn=270076, from master1
> …
> [03/Mar/2021:14:06:07.872623403 -0600] conn=270077 op=-1 fd=307 closed - B1
> [03/Mar/2021:14:06:07.916640931 -0600] conn=270076 op=1106 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
> [03/Mar/2021:14:06:07.917191372 -0600] conn=270076 op=1106 RESULT err=0 tag=120 nentries=0 etime=0.000662946
> [03/Mar/2021:14:06:07.921837193 -0600] conn=270076 op=1107 MOD dn="cn=CRLblahblah,c=US"
> [03/Mar/2021:14:06:07.923754219 -0600] conn=270076 op=1107 RESULT err=0 tag=103 nentries=0 etime=0.002036469 csn=603febdd00065aae0000
> and a flood of changes now
> ...
>
>
> Colin Tulloch
> Architect, USmPKI
> colin.tulloch(a)entrust.com
>
> _______________________________________________
> 389-users mailing list -- 389-users(a)lists.fedoraproject.org
> To unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproje...
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
—
Sincerely,
William Brown
Senior Software Engineer, 389 Directory Server
SUSE Labs, Australia
3 years, 1 month
Unindexed search even on indexed database
by Jan Tomasek
Hello,
I'm worrying about log lines:
[04/Mar/2021:10:08:47.982170561 +0100] - NOTICE - ldbm_back_search -
Unindexed search: search base="o=tcs2,o=apps,dc=cesnet,dc=cz" scope=2
filter="(entryStatus=issued)" conn=115 op=1
Index is defined:
# dsconf -D "cn=Directory Manager" -w "$pswd" ldap://localhost backend
index get TCS2_apps_cesnet_cz --attr entryStatus
dn: cn=entryStatus,cn=index,cn=TCS2_apps_cesnet_cz,cn=ldbm
database,cn=plugins,cn=config
cn: entryStatus
nsIndexType: eq
nsIndexType: pres
nsSystemIndex: False
objectClass: top
objectClass: nsIndex
Database is freshly reindexed:
# dsconf -D "cn=Directory Manager" -w "$pswd" ldap://localhost backend
index reindex TCS2_apps_cesnet_cz --attr entryStatus
Index task index_attrs_03042021_100813 completed successfully
Successfully reindexed database
# tail /var/log/dirsrv/slapd-cml3/errors
[04/Mar/2021:10:08:19.181006893 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Indexed 43000 entries (83%).
[04/Mar/2021:10:08:19.304566154 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Indexed 44000 entries (85%).
[04/Mar/2021:10:08:19.430861272 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Indexed 45000 entries (86%).
[04/Mar/2021:10:08:19.554529568 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Indexed 46000 entries (88%).
[04/Mar/2021:10:08:19.671814136 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Indexed 47000 entries (90%).
[04/Mar/2021:10:08:19.791473662 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Indexed 48000 entries (92%).
[04/Mar/2021:10:08:19.911157930 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Indexed 49000 entries (94%).
[04/Mar/2021:10:08:20.032595700 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Indexed 50000 entries (96%).
[04/Mar/2021:10:08:20.153813121 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Indexed 51000 entries (98%).
[04/Mar/2021:10:08:20.244942556 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Finished indexing.
But server is still complaining:
# time ldapsearch -H ldap://localhost -x -b
o=TCS2,o=apps,dc=cesnet,dc=cz '(entryStatus=issued)'
# extended LDIF
#
# LDAPv3
# base <o=TCS2,o=apps,dc=cesnet,dc=cz> with scope subtree
# filter: (entryStatus=issued)
# requesting: ALL
#
# search result
search: 2
result: 0 Success
# numResponses: 1
real 0m0.920s
user 0m0.014s
sys 0m0.001s
# tail /var/log/dirsrv/slapd-cml3/errors
...
[04/Mar/2021:10:08:20.153813121 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Indexed 51000 entries (98%).
[04/Mar/2021:10:08:20.244942556 +0100] - INFO - bdb_db2index -
TCS2_apps_cesnet_cz: Finished indexing.
[04/Mar/2021:10:08:47.982170561 +0100] - NOTICE - ldbm_back_search -
Unindexed search: search base="o=tcs2,o=apps,dc=cesnet,dc=cz" scope=2
filter="(entryStatus=issued)" conn=115 op=1
Some DB were created during reindex:
# ls -l /var/lib/dirsrv/slapd-cml3/db/TCS2_apps_cesnet_cz/
total 358212
-rw------- 1 dirsrv dirsrv 16384 Feb 22 10:08 aci.db
-rw------- 1 dirsrv dirsrv 2760704 Feb 22 10:09 ancestorid.db
-rw------- 1 dirsrv dirsrv 19668992 Mar 3 16:53 cn.db
-rw------- 1 dirsrv dirsrv 51 Mar 4 09:54 DBVERSION
-rw------- 1 dirsrv dirsrv 24576 Mar 3 16:53 dc.db
-rw------- 1 dirsrv dirsrv 13254656 Feb 22 10:09 entryrdn.db
-rw------- 1 dirsrv dirsrv 1114112 Mar 4 10:08 entryStatus.db
-rw------- 1 dirsrv dirsrv 16384 Feb 22 10:07 entryusn.db
-rw------- 1 dirsrv dirsrv 3063808 Mar 3 16:54 givenName.db
-rw------- 1 dirsrv dirsrv 285548544 Mar 4 10:09 id2entry.db
-rw------- 1 dirsrv dirsrv 16056320 Mar 3 16:54 mail.db
-rw------- 1 dirsrv dirsrv 16384 Feb 22 10:08 nscpEntryDN.db
-rw------- 1 dirsrv dirsrv 3891200 Feb 22 10:09 nsuniqueid.db
-rw------- 1 dirsrv dirsrv 24576 Feb 22 10:09 numsubordinates.db
-rw------- 1 dirsrv dirsrv 1466368 Feb 22 10:09 objectclass.db
-rw------- 1 dirsrv dirsrv 811008 Feb 22 10:09 parentid.db
-rw------- 1 dirsrv dirsrv 258048 Mar 4 10:00 replication_changelog.db
-rw------- 1 dirsrv dirsrv 3735552 Mar 3 16:55 sn.db
-rw------- 1 dirsrv dirsrv 335872 Mar 4 09:26 tcs2certificate.db
-rw------- 1 dirsrv dirsrv 32768 Mar 3 16:55 tcs2cesnetorgdn.db
-rw------- 1 dirsrv dirsrv 2826240 Mar 3 16:55 tcs2crtserialnumber.db
-rw------- 1 dirsrv dirsrv 3809280 Mar 3 16:55 tcs2crtsubject.db
-rw------- 1 dirsrv dirsrv 16384 Mar 3 16:55 tcs2idpentityid.db
-rw------- 1 dirsrv dirsrv 3276800 Mar 3 16:56 tcs2requesterdn.db
-rw------- 1 dirsrv dirsrv 393216 Mar 3 16:56 tcs2role.db
-rw------- 1 dirsrv dirsrv 3219456 Mar 3 16:56 telephoneNumber.db
-rw------- 1 dirsrv dirsrv 516096 Mar 3 16:56 uid.db
-rw------- 1 dirsrv dirsrv 647168 Mar 3 16:57 unstructuredname.db
Any ideas how to fix the problem?
--
-----------------------
Jan Tomasek aka Semik
http://www.tomasek.cz/
3 years, 1 month
replication is failing
by Chris Patterson
Using 389 DS and directory server replication is failing. I am getting:
NSMMReplictionPlugin Unable to require replica for total update error 49
retrying
NSMMReplicationPlugin bind_and_check_pwp Replication bind with SIMPLE auth
failed LDAP error 19 (constraint violation) (Exceed password retry limit)
This used to work until the 180 password time frame happened on this
new-ish server.
I almost suspect it is the server wide password policy that has caused this
3 years, 1 month
Replication delay, connection blocking ending in closed - B1
by Colin Tulloch
Hello -
We are seeing an issue where changes can be very slow to replicate to one of our consumers (up to 15m+). We have a large topology, but in this case the issue is isolated between 2 masters that replicate to 1 consumer.
In one example entry addition I found, it appears that we see;
- connection from master1 to consumer1, bunch of changes pushed
- then master2 connects to consumer1
- master1 stops doing changes but stays connected
- master2 does literally 1 change (success, no issues), stays connected
- 16 minutes goes by, no additional changes or replication EXT ops for that connection are done (master1->consumer1 EXT ops continue normally...)
- after that long pause, master2 disconnects - B1 bad BER tag code
- and then Master1 resumes making tons of changes
Searches in this DB and others continue to take place, and changes in other DBs. So it wasn't as if the server was unresponsive/hung. It is almost as if that DB went "read-only" for a time - I'm unable to tell if something else besides replication was attempting but unable to make writes though.
Anyone see something like this before? We see lots of B1 codes randomly, I've never understood what may cause that - the description of corruption/physical network problems does not make much sense. Maybe if our directories were replicating to eachother over the internet, or using Wifi....
The time it takes for that connection to end in the B1 doesn't seem to line up with any dirsrv OR system/TCP timeouts either
Nothing illuminating in the error logs really.
Log snips of this happening;
[03/Mar/2021:13:50:34.395939343 -0600] conn=270076 fd=280 slot=280 connection from master1 to consumer1
...
[03/Mar/2021:13:50:34.608149165 -0600] conn=270076 op=13 MOD dn="cn=CRLblahblah,c=US"
[03/Mar/2021:13:50:34.609565373 -0600] conn=270076 op=13 RESULT err=0 tag=103 nentries=0 etime=0.001451229 csn=603febdd00035aae0000
[03/Mar/2021:13:50:34.818645762 -0600] conn=270076 op=14 EXT oid="2.16.840.1.113730.3.5.5" name="replication-multimaster-extop"
[03/Mar/2021:13:50:34.820544342 -0600] conn=270076 op=14 RESULT err=0 tag=120 nentries=0 etime=0.002004143
[03/Mar/2021:13:50:34.460210520 -0600] conn=270077 fd=307 slot=307 connection from master2 to consumer1
[03/Mar/2021:13:50:34.460676562 -0600] conn=270077 op=0 BIND dn="cn=replication manager,cn=config" method=128 version=3
[03/Mar/2021:13:50:34.460992487 -0600] conn=270077 op=0 RESULT err=0 tag=97 nentries=0 etime=0.000376400 dn="cn=replication manager,cn=config"
<snipped replication startup jargon>
[03/Mar/2021:13:50:35.715331361 -0600] conn=270077 op=5 MOD dn="cn=CRLblahblah,c=US"
[03/Mar/2021:13:50:35.717330540 -0600] conn=270077 op=5 RESULT err=0 tag=103 nentries=0 etime=0.002066162 csn=603febdd00045aae0000
...
[03/Mar/2021:13:50:36.828647054 -0600] conn=270076 op=15 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[03/Mar/2021:13:50:36.828888493 -0600] conn=270076 op=15 RESULT err=0 tag=120 nentries=0 etime=0.000368293
[03/Mar/2021:13:50:37.112334122 -0600] conn=270076 op=16 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[03/Mar/2021:13:50:37.112782309 -0600] conn=270076 op=16 RESULT err=0 tag=120 nentries=0 etime=0.000624719
[03/Mar/2021:13:50:38.113792312 -0600] conn=270076 op=17 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[03/Mar/2021:13:50:38.114092751 -0600] conn=270076 op=17 RESULT err=0 tag=120 nentries=0 etime=0.000476209
...
continued EXT ops on conn=270076, from master1
...
[03/Mar/2021:14:06:07.872623403 -0600] conn=270077 op=-1 fd=307 closed - B1
[03/Mar/2021:14:06:07.916640931 -0600] conn=270076 op=1106 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[03/Mar/2021:14:06:07.917191372 -0600] conn=270076 op=1106 RESULT err=0 tag=120 nentries=0 etime=0.000662946
[03/Mar/2021:14:06:07.921837193 -0600] conn=270076 op=1107 MOD dn="cn=CRLblahblah,c=US"
[03/Mar/2021:14:06:07.923754219 -0600] conn=270076 op=1107 RESULT err=0 tag=103 nentries=0 etime=0.002036469 csn=603febdd00065aae0000
and a flood of changes now
...
Colin Tulloch
Architect, USmPKI
colin.tulloch(a)entrust.com<mailto:colin.tulloch@entrust.com>
3 years, 1 month
replication is failing
by Chris Patterson
Using 389 DS and directory server replication is failing. I am getting:
NSMMReplictionPlugin Unable to require replica for total update error 49
retrying
NSMMReplicationPlugin bind_and_check_pwp Replication bind with SIMPLE auth
failed LDAP error 19 (constraint violation) (Exceed password retry limit)
This used to work until the 180 password time frame happened on this
new-ish server. I almost suspect it is the server wide password policy that
has caused this, but I am unsure how to fix it.
I checked the error log and it yielded
NSMMReplicationPlugin bind_and_check_pwp successfully bound cn=replication
manager,cn=config to consumer, but password is expiring on consumer in 100
seconds.
And once it expires
NSMMReplicationPlugin bind_and_check_pwp replication bind with SIMPLE auth
failed: LDAP error 49 (Invalid credentials) (password expired)
Thanks in advance, Chris
3 years, 1 month