Dear flo,
Thanks for the helpful links.
To check whether replication is possible between the three freeipa
servers, via the web interface on each, I have successfully created three
new users:
+ On server 1 create a new user 1 and check they appear on servers 2 & 3 - yes
+ On server 2 create a new user 2 and check they appear on servers 1 & 3 - yes
+ On server 3 create a new user 3 and check they appear on servers 1 & 2 - yes
Then for each user remove them from a different server to where they were
created. This worked as expected.
However I do appear to have one user whose information failed to
correctly replicate some time ago, causing an inconsistency that I am
uncertain on the best way to fix.
On server 1 via web attempt to view this specific user:
Operations Error
Some operations failed.
XXX: user not found
which should be displayed as a preserved user entry but lacks any details
shown other than the "user login".
On server 2 this user appears as an active user, with details, status
disabled.
On server 3 this user appears as a preserved user, with full details, and
a note I added to the Class field a few days ago to see whether it would
then just replicate but failed.
In logs, searching for the user in question on server 1
./httpd/error_log:[Sun Sep 20 10:34:14.256333 2020] [wsgi:error] [pid 29637] [remote IP_ADDRESS:56930] ipa: INFO: [jsonserver_session] sm@OUR_DOMAIN: user_find(u'USER_XXX', sizelimit=0, version=u'2.215', pkey_only=True): SUCCESS
./httpd/error_log:[Sun Sep 20 10:34:28.721951 2020] [wsgi:error] [pid 29637] [remote IP_ADDRESS:56932] ipa: INFO: [jsonserver_session] sm@OUR_DOMAIN: user_find(u'USER_XXX', preserved=True, sizelimit=0, version=u'2.215', pkey_only=True): SUCCESS
./httpd/error_log:[Sun Sep 20 10:34:28.738772 2020] [wsgi:error] [pid 29636] [remote IP_ADDRESS:56932] ipa: INFO: sm@OUR_DOMAIN: batch: user_show(u'USER_XXX', no_members=True): NotFound
./httpd/error_log:[Sun Sep 20 10:34:28.738952 2020] [wsgi:error] [pid 29636] [remote IP_ADDRESS:56932] ipa: INFO: [jsonserver_session] sm@OUR_DOMAIN: batch(({u'params': ((u'USER_XXX',), {u'no_members': True}), u'method': u'user_show'},), version=u'2.215'): SUCCESS
Which just confirms what was seen via the web interface.
Next go looking in /var/log/dirsrv/slapd-OUR-DOMAIN/errors and find lots
of these messages:
[20/Sep/2020:11:18:12.449932996 +0100] - ERR - NSMMReplicationPlugin - changelog program - repl_plugin_name_cl - agmt="cn=caToserver2" (server2:389): CSN 5bb473ef000000070000 not found, we aren't as up to date, or we purged
[20/Sep/2020:11:18:12.452051936 +0100] - ERR - NSMMReplicationPlugin - send_updates - agmt="cn=caToserver2" (server2:389): Data required to update replica has been purged from the changelog. If the error persists the replica must be reinitialized.
[20/Sep/2020:11:18:15.479226915 +0100] - ERR - agmt="cn=caToserver2" (server2:389) - clcache_load_buffer - Can't locate CSN 5bb473ef000000070000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.
[20/Sep/2020:11:18:15.481677959 +0100] - ERR - NSMMReplicationPlugin - changelog program - repl_plugin_name_cl - agmt="cn=caToserver2" (server2:389): CSN 5bb473ef000000070000 not found, we aren't as up to date, or we purged
[20/Sep/2020:11:18:15.483458741 +0100] - ERR - NSMMReplicationPlugin - send_updates - agmt="cn=caToserver2" (server2:389): Data required to update replica has been purged from the changelog. If the error persists the replica must be reinitialized.
But is that the only CSN found to be missing, seems to be based on this
search:
grep -v 5bb473ef000000070000 ./dirsrv/slapd-OUR-DOMAIN/errors | grep -v "If the error persists"
389-Directory/1.3.6.6 B2017.145.2031
server1.OUR-DOMAIN:636 (/etc/dirsrv/slapd-OUR-DOMAIN)
[18/Sep/2020:05:57:26.038958649 +0100] - ERR - NSMMReplicationPlugin - release_replica - agmt="cn=meToserver3" (server3:389): Unable to parse the response to the endReplication extended operation.
[18/Sep/2020:17:45:55.236862951 +0100] - ERR - NSMMReplicationPlugin - release_replica - agmt="cn=caToserver2" (server2:389): Unable to parse the response to the endReplication extended operation.
Turning next to server 2 and the dirsrv logs there, lots of messages of the form:
[17/Sep/2020:19:33:24.075845464 +0100] - ERR - DSRetroclPlugin - delete_changerecord: could not delete change record 6773158 (rc: 32)
[17/Sep/2020:19:33:24.076296207 +0100] - ERR - DSRetroclPlugin - delete_changerecord: could not delete change record 6773159 (rc: 32)
[17/Sep/2020:19:33:24.077311010 +0100] - ERR - DSRetroclPlugin - delete_changerecord: could not delete change record 6773160 (rc: 32)
[17/Sep/2020:19:33:24.077830624 +0100] - ERR - DSRetroclPlugin - delete_changerecord: could not delete change record 6773161 (rc: 32)
[19/Sep/2020:19:59:28.091654453 +0100] - ERR - NSMMReplicationPlugin - release_replica - agmt="cn=meToserver1" (server1:389): Unable to parse the response to the endReplication extended operation.
Finally for server 3, dirsrv logs appear to be more about getting started
after certificate problems the other day, rather than issues seen on the
other servers, but this server does appear to display the user fully as
would be desired everywhere.
Our old version appears to lack the 389 command dsconf so not been able to
use that to help.
Summary:
+ Replication appears to work for e.g. new users
+ I guess one entry has failed to be replicated correctly causing the
inconsistent state on the other two servers
+ This should ideally be corrected to avoid further issues???
+ Uncertain about the best way to do this, suggestions very welcome as
first time fixing such problems and I don't want to make it any worse,
after all replications seems to be working for everything else.
Thanks
Best wishes
Stuart