Hi Thierry,
Il 11/04/2019 11:31, thierry bordaz via FreeIPA-users ha scritto:
Hi Giulio,
During the new IPA server installation (idc01) the server idc02 sends all its entries (total update), one after the other. The entries are sent idc02->idc01 over a sasl encrypted connection. I suspect that one of the entry sent by idc02 is large (a static group ?) and its encrypted size overpass the default limit set on idc01 (2Mb). I think your solution is the good one.
If you have big static groups, do you know how large are the biggest ones ?
Yes, I do have some big group. In particular one of them contains 17000+ users.
According to the logged error, It looks to me that most important one to tune was nsslapd-maxsasliosize. Possibly IPA installer could increase this value to manage large groups
I agree, despite it's a 389-ds error, I think it should be fixed by IPA guys at "ipa-replica-install" level (maybe predicting the maximum size of sasl packet).
Thank you again, Giulio
best regads thierry
On 4/11/19 10:36 AM, Giulio Casella wrote:
Hi Thierry, Rob, Flo,
unfortunately I have no failure log anymore (after a couple of reinstallations they get lost). Anyway I'll try to reconstruct some information to help you investigate further. The behaviour was:
- the IPA replication started, coming rapidly to "[28/41]: setting up
initial replication".
- Near the end of replication, after about 20 secs, the process aborted
with a message: [ldap://idc02.my.dom.ain:389] reports: Update failed! Status: [Error (-11) connection error: Unknown connection error (-11) - Total update aborted]
idc02 is the working IPA/389-ds server.
on idc01 (the wannabe-replica) I found (in dirsrv error log):
(idc01:389): Received error -1 (Can't contact LDAP server): for total update operation
and somewhere else in the same file on idc01 a message similar to:
SASL encrypted packet length exceeds maximum allowed limit
- At the time of crash I noticed (via a tcpdump session) some "TCP zero
window" message in the capture, sent by idc01 to idc02
- After that the 389-ds server on idc01 was up, but many other IPA
parts were not (that's why I say the IPA replica setup crashed, no try to rollback was made). And the working server was up, but somehow "dirt", with some replica update vector (RUV) still pointing to idc01.
- The solution was to pass "--dirsrv-config-file=custom.ldif" to
ipa-replica-install, with custom.ldif containing:
dn: cn=config changetype: modify replace: nsslapd-maxsasliosize nsslapd-maxsasliosize: 4194304 replace: nsslapd-sasl-max-buffer-size nsslapd-sasl-max-buffer-size: 4194304
(original value was 2097152 for both configuration variables).
This make me think that "TCP zero window" was only a consequence, not a cause. After this tweak everything worked like a charme.
A couple of consideration:
- I think you can reproduce the wrong behaviour doing the right
opposite as I did, decreasing those two values. I don't know exactly how much.
- Maybe ipa-replica-install should try to catch this situation, output
something more explanatory, and possibly try to rollback.
I'm sorry I've no real log to post, but I hope this helps anyway.
Thank you and regards, Giulio
Il 10/04/2019 17:44, thierry bordaz ha scritto:
On 4/10/19 4:59 PM, Rob Crittenden wrote:
Giulio Casella via FreeIPA-users wrote:
Hi, I managed to fix it! The solution was to increase a couple of parameters in ldap config. I passed "--dirsrv-config-file=custom.ldif" to ipa-replica-install, with custom.ldif containing:
dn: cn=config changetype: modify replace: nsslapd-maxsasliosize nsslapd-maxsasliosize: 4194304 replace: nsslapd-sasl-max-buffer-size nsslapd-sasl-max-buffer-size: 4194304
In brief I doubled the sasl buffer size, because I noticed a log message saying "SASL encrypted packet length exceeds maximum allowed limit".
But the behaviour of ipa-replica-install was quite strange, it crashed, and in a packet capture session I noticed the presence of some "TCP zero window" packets sent from wannabe-replica to existing ipa server. Maybe developers want to try to catch that error and revert the operation, just like is done with other kind of errors.
Maybe one of the 389-ds devs have an idea. They're probably going to want to see logs and what your definition of crash is.
rob
TCP zero window make me think to a client not reading fast enough. Is it transient/recoverable or not ?
Rob is right, if a problem is detected at 389-ds level, access/errors logs are appreciated. and also the ipa-replica-install backstack when it crashed.
regards thierry
Ciao, g
Il 01/04/2019 15:28, Giulio Casella via FreeIPA-users ha scritto:
Hi, I'm still stuck on this, I tried to delete every reference to the old server, with ipa commands ("ipa-replica-manage clean-ruv") and directly in ldap (as reported in https://access.redhat.com/solutions/136993).
If I try to "ipa-replica-manage list-ruv" on idc02 I get:
Replica Update Vectors: idc02.my.dom.ain:389: 5 Certificate Server Replica Update Vectors: idc02.my.dom.ain:389: 91
(same result looking directly into ldap)
is it correct? Does a server has replica reference to itself?
I also tried to instantiate a new server, idc03.my.dom.ain, never known before (fresh centos install, ipa-client-install, ipa-replica-install). The setup (surprisingly to me) failed (details below).
At this point I suspect the problem is on idc02 (the only working server), unrelated to previous server idc01.
For completeness this is what I did:
. Fresh install of a CentOS 7 box, updated, installed ipa software (name idc03.my.dom.ain) . ipa-client-install --principal admin --domain=my.dom.ain --realm=MY.DOM.AIN --force-join . ipa-replica-install --setup-dns --no-forwarders --setup-ca
Last command failed (in "[28/41]: setting up initial replication"), and in /var/log/ipareplica-install.log of idc03 I read:
[...] 2019-03-28T09:30:48Z DEBUG [28/41]: setting up initial replication 2019-03-28T09:30:48Z DEBUG retrieving schema for SchemaCache url=ldapi://%2fvar%2frun%2fslapd-MY-DOM-AIN.socket conn=<ldap.ldapobject.SimpleLDAPObject instance at 0x7fb72af73050> 2019-03-28T09:30:48Z DEBUG Destroyed connection context.ldap2_140424739228880 2019-03-28T09:30:48Z DEBUG Starting external process 2019-03-28T09:30:48Z DEBUG args=/bin/systemctl --system daemon-reload 2019-03-28T09:30:48Z DEBUG Process finished, return code=0 2019-03-28T09:30:48Z DEBUG stdout= 2019-03-28T09:30:48Z DEBUG stderr= 2019-03-28T09:30:48Z DEBUG Starting external process 2019-03-28T09:30:48Z DEBUG args=/bin/systemctl restart dirsrv@MY-DOM-AIN.service 2019-03-28T09:30:54Z DEBUG Process finished, return code=0 2019-03-28T09:30:54Z DEBUG stdout= 2019-03-28T09:30:54Z DEBUG stderr= 2019-03-28T09:30:54Z DEBUG Restart of dirsrv@MY-DOM-AIN.service complete 2019-03-28T09:30:54Z DEBUG Created connection context.ldap2_140424739228880 2019-03-28T09:30:55Z DEBUG Fetching nsDS5ReplicaId from master [attempt 1/5] 2019-03-28T09:30:55Z DEBUG retrieving schema for SchemaCache url=ldap://idc02.my.dom.ain:389 conn=<ldap.ldapobject.SimpleLDAPObject instance at 0x7fb72bf8e128> 2019-03-28T09:30:55Z DEBUG Successfully updated nsDS5ReplicaId. 2019-03-28T09:30:55Z DEBUG Add or update replica config cn=replica,cn=dc=my,dc=dom,dc=ain,cn=mapping tree,cn=config 2019-03-28T09:30:55Z DEBUG Added replica config cn=replica,cn=dc=my,dc=dom,dc=ain,cn=mapping tree,cn=config 2019-03-28T09:30:55Z DEBUG Add or update replica config cn=replica,cn=dc=my,dc=dom,dc=ain,cn=mapping tree,cn=config 2019-03-28T09:30:55Z DEBUG No update to cn=replica,cn=dc=my,dc=dom,dc=ain,cn=mapping tree,cn=config necessary 2019-03-28T09:30:55Z DEBUG Waiting for replication (ldap://idc02.my.dom.ain:389) cn=meToidc03.my.dom.ain,cn=replica,cn=dc=my,dc=dom,dc=ain,cn=mapping
tree,cn=config (objectclass=*) 2019-03-28T09:30:55Z DEBUG Entry found [LDAPEntry(ipapython.dn.DN('cn=meToidc03.my.dom.ain,cn=replica,cn=dc=my,dc=dom,dc=ain,cn=mapping
tree,cn=config'), {u'nsds5replicaLastInitStart': ['19700101000000Z'], u'nsds5replicaUpdateInProgress': ['FALSE'], u'cn': ['meToidc03.my.dom.ain'], u'objectClass': ['nsds5replicationagreement', 'top'], u'nsds5replicaLastUpdateEnd': ['19700101000000Z'], u'nsDS5ReplicaRoot': ['dc=my,dc=dom,dc=ain'], u'nsDS5ReplicaHost': ['idc03.my.dom.ain'], u'nsds5replicaLastUpdateStatus': ['Error (0) No replication sessions started since server startup'], u'nsDS5ReplicaBindMethod': ['SASL/GSSAPI'], u'nsds5ReplicaStripAttrs': ['modifiersName modifyTimestamp internalModifiersName internalModifyTimestamp'], u'nsds5replicaLastUpdateStart': ['19700101000000Z'], u'nsDS5ReplicaPort': ['389'], u'nsDS5ReplicaTransportInfo': ['LDAP'], u'description': ['me to idc03.my.dom.ain'], u'nsds5replicareapactive': ['0'], u'nsds5replicaChangesSentSinceStartup': [''], u'nsds5replicaTimeout': ['120'], u'nsDS5ReplicatedAttributeList': ['(objectclass=*) $ EXCLUDE memberof idnssoaserial entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount'], u'nsds5replicaLastInitEnd': ['19700101000000Z'], u'nsDS5ReplicatedAttributeListTotal': ['(objectclass=*) $ EXCLUDE entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount']})] 2019-03-28T09:30:55Z DEBUG Entry found [LDAPEntry(ipapython.dn.DN('cn=meToidc02.my.dom.ain,cn=replica,cn=dc=my,dc=dom,dc=ain,cn=mapping
tree,cn=config'), {u'nsds5replicaLastInitStart': ['19700101000000Z'], u'nsds5replicaUpdateInProgress': ['FALSE'], u'cn': ['meToidc02.my.dom.ain'], u'objectClass': ['nsds5replicationagreement', 'top'], u'nsds5replicaLastUpdateEnd': ['19700101000000Z'], u'nsDS5ReplicaRoot': ['dc=my,dc=dom,dc=ain'], u'nsDS5ReplicaHost': ['idc02.my.dom.ain'], u'nsds5replicaLastUpdateStatus': ['Error (0) No replication sessions started since server startup'], u'nsDS5ReplicaBindMethod': ['SASL/GSSAPI'], u'nsds5ReplicaStripAttrs': ['modifiersName modifyTimestamp internalModifiersName internalModifyTimestamp'], u'nsds5replicaLastUpdateStart': ['19700101000000Z'], u'nsDS5ReplicaPort': ['389'], u'nsDS5ReplicaTransportInfo': ['LDAP'], u'description': ['me to idc02.my.dom.ain'], u'nsds5replicareapactive': ['0'], u'nsds5replicaChangesSentSinceStartup': [''], u'nsds5replicaTimeout': ['120'], u'nsDS5ReplicatedAttributeList': ['(objectclass=*) $ EXCLUDE memberof idnssoaserial entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount'], u'nsds5replicaLastInitEnd': ['19700101000000Z'], u'nsDS5ReplicatedAttributeListTotal': ['(objectclass=*) $ EXCLUDE entryusn krblastsuccessfulauth krblastfailedauth krbloginfailedcount']})] 2019-03-28T09:31:15Z DEBUG Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 570, in start_creation run_step(full_msg, method) File "/usr/lib/python2.7/site-packages/ipaserver/install/service.py", line 560, in run_step method() File "/usr/lib/python2.7/site-packages/ipaserver/install/dsinstance.py", line 456, in __setup_replica cacert=self.ca_file File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py", line 1817, in setup_promote_replication raise RuntimeError("Failed to start replication") RuntimeError: Failed to start replication [...]
while in /var/log/dirsrv/slapd-MY-DOM-AIN/errors of idc02 I can find:
[...] [28/Mar/2019:10:30:56.602197981 +0100] - INFO - NSMMReplicationPlugin - repl5_tot_run - Beginning total update of replica "agmt="cn=meToidc03.my.dom.ain" (idc03:389)". [28/Mar/2019:10:31:15.787867217 +0100] - ERR - NSMMReplicationPlugin - repl5_tot_log_operation_failure - agmt="cn=meToidc03.my.dom.ain" (idc03:389): Received error -1 (Can't contact LDAP server): for total update operation [28/Mar/2019:10:31:15.789885458 +0100] - ERR - NSMMReplicationPlugin - release_replica - agmt="cn=meToidc03.my.dom.ain" (idc03:389): Unable to send endReplication extended operation (Can't contact LDAP server) [28/Mar/2019:10:31:15.791374133 +0100] - ERR - NSMMReplicationPlugin - repl5_tot_run - Total update failed for replica "agmt="cn=meToidc03.my.dom.ain" (idc03:389)", error (-11) [28/Mar/2019:10:31:15.823809612 +0100] - INFO - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=meToidc03.my.dom.ain" (idc03:389): Replication bind with GSSAPI auth resumed [28/Mar/2019:10:31:16.221049084 +0100] - WARN - NSMMReplicationPlugin - repl5_inc_run - agmt="cn=meToidc03.my.dom.ain" (idc03:389): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica. [28/Mar/2019:10:31:19.234198978 +0100] - WARN - NSMMReplicationPlugin - repl5_inc_run - agmt="cn=meToidc03.my.dom.ain" (idc03:389): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica. [28/Mar/2019:10:31:22.247206811 +0100] - WARN - NSMMReplicationPlugin - repl5_inc_run - agmt="cn=meToidc03.my.dom.ain" (idc03:389): The remote replica has a different database generation ID than the local database. You may have to reinitialize the remote replica, or the local replica.
Last message keeps repeating until I uninstall replica on idc03.
How can I restore a scenario with a redundant setup (more than one ipa server)?
Thanks in advance, Giulio Casella
Il 26/03/2019 11:08, Giulio Casella via FreeIPA-users ha scritto: > Hi Flo, > > Il 26/03/2019 09:45, Florence Blanc-Renaud via FreeIPA-users ha > scritto: >> On 3/20/19 9:32 AM, Giulio Casella via FreeIPA-users wrote: >>> Hi everyone, >>> I'm stuck with a broken replica. I had a setup with two ipa >>> server in >>> replica (ipa-server-4.6.4 on CentOS 7.6), let's say "idc01" and >>> "idc02". >>> >>> Due to heavy load idc01 crashed many times, and was not working >>> anymore. >>> >>> So I tried to redo the replica again. At first I tried to >>> "ipa-replica-manage re-initialize", with no success. >>> >>> Now I'm trying to redo from scratch the replica setup: on idc02 I >>> removed the segments (ipa topologysegment-del, for both ca and >>> domain >>> suffix), on idc01 I removed everything (ipa-server-install >>> --uninstall), >>> then I joined domain (ipa-client-install), and everything is >>> working >>> so far. >>> >>> When doing "ipa-replica-install" on idc01 I get: >>> >>> [...] >>> [28/41]: setting up initial replication >>> Starting replication, please wait until this has completed. >>> Update in progress, 22 seconds elapsed >>> [ldap://idc02.my.dom.ain:389] reports: Update failed! Status: >>> [Error >>> (-11) connection error: Unknown connection error (-11) - Total >>> update >>> aborted] >>> >>> >>> And on idc02 (the working server), in >>> /var/log/dirsrv/slapd-MY-DOM-AIN/errors I find lines stating: >>> >>> [20/Mar/2019:09:28:06.545187923 +0100] - INFO - >>> NSMMReplicationPlugin - >>> repl5_tot_run - Beginning total update of replica >>> "agmt="cn=meToidc01.my.dom.ain" (idc01:389)". >>> [20/Mar/2019:09:28:26.528046160 +0100] - ERR - >>> NSMMReplicationPlugin - >>> perform_operation - agmt="cn=meToidc01.my.dom.ain" (idc01:389): >>> Failed >>> to send extended operation: LDAP error -1 (Can't contact LDAP >>> server) >>> [20/Mar/2019:09:28:26.530763939 +0100] - ERR - >>> NSMMReplicationPlugin - >>> repl5_tot_log_operation_failure - agmt="cn=meToidc01.my.dom.ain" >>> (idc01:389): Received error -1 (Can't contact LDAP server): for >>> total >>> update operation >>> [20/Mar/2019:09:28:26.532678072 +0100] - ERR - >>> NSMMReplicationPlugin - >>> release_replica - agmt="cn=meToidc01.my.dom.ain" (idc01:389): >>> Unable to >>> send endReplication extended operation (Can't contact LDAP server) >>> [20/Mar/2019:09:28:26.534307539 +0100] - ERR - >>> NSMMReplicationPlugin - >>> repl5_tot_run - Total update failed for replica >>> "agmt="cn=meToidc01.my.dom.ain" (idc01:389)", error (-11) >>> [20/Mar/2019:09:28:26.561763168 +0100] - INFO - >>> NSMMReplicationPlugin - >>> bind_and_check_pwp - agmt="cn=meToidc01.my.dom.ain" (idc01:389): >>> Replication bind with GSSAPI auth resumed >>> [20/Mar/2019:09:28:26.582389258 +0100] - WARN - >>> NSMMReplicationPlugin - >>> repl5_inc_run - agmt="cn=meToidc01.my.dom.ain" (idc01:389): The >>> remote >>> replica has a different database generation ID than the local >>> database. >>> You may have to reinitialize the remote replica, or the local >>> replica. >>> >>> >>> It seems that idc02 remembers something about the old replica. >>> >>> Any hint? >>> >> Hi, >> >> In order to clean every reference to the old replica: >> (on idc01) >> $ ipa-server-install --uninstall -U >> $ kdestroy -A >> >> (on idc02) >> $ ipa-replica-manage del idc01.my.dom.ain --clean --force >> >> Then you should be able to reinstall idc01 as a replica. > No way, same result, it hangs in "[28/41]: setting up initial > replication", after about 20 secs. > I also tried, on idc02, to clean all RUVs referring idc01, with no > luck. > _______________________________________________ > FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org > To unsubscribe send an email to > freeipa-users-leave@lists.fedorahosted.org > Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html > List Guidelines: > https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahoste... > > > _______________________________________________ FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org To unsubscribe send an email to freeipa-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahoste...
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org To unsubscribe send an email to freeipa-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahoste...
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org To unsubscribe send an email to freeipa-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/freeipa-users@lists.fedorahoste...