On 1/15/24 13:26, Rob Crittenden wrote:

Harry G Coin via FreeIPA-users wrote:

Hi!   This is meant for the good future of freeipa, a package I've
appreciated for some years, so across the user cultures and languages
please understand it as supportive and not a complaint!

 For all freeipa's 'master-master' replica technology, there remain
'some instances more primary than others' even if the topology diagrams
claim equivalence.  Lose 'that one that's even more primary' and (absent
high-learning-curve, on-site capability, and intervention that calls for
high-bar mastery of seldom used subsystems) -- you're on a track to
breakage.  Why?  Because it's when, not if, that 'primary' system will
need a major OS point release (8 to 9 in the present situation).  In
that case, there is as yet no 'just works' upgrade path.  With 'not the
super special 'even more master than other' master replicas, it's easy
and 'it just works'... but 'for that one...' freeipa is not ready for
'prime time'.

For example, should site admins 'just know' whether there is a current
kasp.db maintained in more than one place?  How many know about
ipa-crlgen-manage,  or whether /etc/pki/pki-tomcat/ca/CS.cfg should or
shouldn't have ca.certStatusUpdateInterval=0, or have the command ipa
config-mod --ca-renewal-master-server at the top of their mind?  SID
range assignments?

Fundamentally, the fair question is:  Which freeipa subsystems that I
don't happen to have studied in dev-level detail have similar 'deep
gotchas that are obvious to the one who specializes in that, but opaque
to everyone else'?  Not even the freeipa devs who write the docs collect
all the steps in one place.  While there are 'characterizations of
worries' those come without steps, the advice doesn't say what  steps
will work, just what won't. ('don't leapp upgrade').

The way forward I think is fairly doable.  First is to have each 'dev
that's an expert in their thing' (dns, kra, etc. etc.) make sure all
'master' level replicas have, updated, whatever 'special files' might be
necessary, even if they aren't 'the extra special primary replica', and
may never get used.

Second is an 'orchestration' command, to be run on a master-replica that
is 'the latest os', that will, 'all in one', do all the magic to become
'the extra special primary master' and take those options off 'the old
primary', even if it means installing trust/dns/etc subsystems extant on
the 'old master' but missing from the 'soon to be new primary master'.  
An orchestration command that manages everything from moving which fqdn
is authoritative  in SOA records, to magic tiny entries in CA.cfg files.
   When that command is done, the 'old primary' becomes 'just another
master replica that happens to be using an older os'.  Then the 'old
primary' can be discarded and replaced with the latest os and a fresh
install as a master replica.  At that point, it's optional whether to
move the 'special primary' status 'back' to the 'now new OS master system'.

The admin pain involved at present 'for that one system that's the extra
special primary' at os major release upgrade time -- it sets too high an
education bar, obviously higher than even one freeipa-dev has, as the
docs prove-- and as such needs a team approach to address,, before OS 9
to 10 please!

There is a whole guide on considerations when doing a major RHEL release
migration. Have you seen it? Is it insufficient?

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9/html/migrating_to_identity_management_on_rhel_9

I suppose an Ansible role could be created to do this type of transition
and it would be a lot less error-prone.

Of all the things you mention the only really catastrophic one to miss
is setting the renewal master. This can lead to all the certs expiring
and general pain overall. The rest are either highly visible (no DNSSEC
keys) or more often not used at all (CRL).

ipa-healthcheck tries to verify a few of these but it tends to focus
only one a single system and not the entire cluster. So far anyway.

rob

Hi Rob

Yes, I did read it. You'll notice it has sections labelled 'optional' without guidance about whether the 'option' is appropriate (subsystem specific deep knowledge required). There is a certain 'ethos' or 'outlook' in your suggestion that 'highly visible' errors and mis-steps are 'ok' because (presuming you have the wit and skills to look for it as the official guide gives no hint) they are 'highly visible'. Well you might ask yourself the question: where does it give steps to test whether the 'highly visible' problems exist before the day comes the underlying capability expires and suddenly it all 'just doesn't work'.

Ultimately, the fact the Freeipa team had to write an extensive and even, at this time incomplete, guide to getting something advertised as 'master' to 'just work' across an os point release... shouldn't that suggest capturing that knowledge in an orchestration command is appropriate?

Notice the 'ceph' community also took a long while to get to the place they decided ansible and other approaches were not going to 'cut it' and created their own 'orchestrator' which vaulted the effort from 'useful for those who don't need to know other things as well' to 'broadly deployable'.

I think what we are seeing here in Freeipa is an emerging reality common to all systems that aim to treat a collection in partnership as 'one thing'.