Hi

So, I have what I think seems to be a slightly odd problem. And I think I've worked out what the solution might be - but not the root cause. In any case, I wanted to run it by you all and see whether you agree or have any insight into it.

The background

running 6 directory servers 4.5.0-21 on CentOS 7.4.1708, 3 of which have the CA role. I've been running the directory blissfully uneventfully for 7ish months now. We have experimented a little bit with the CA features, but nothing that can't be done trivially with the web interface (on reflection I'm sure it probably is trivial to revoke your primary certificate authority with the web interface, but you know what I mean).

The problem

In the past few days I've had the occasion to try to create a new replica but on each attempt, the process fails around this time:

[4/4]: configuring ipa-custodia to start on boot
Done configuring ipa-custodia.
The ipa-replica-install command failed, exception: HTTPError: 404 Client Error: Not Found
404 Client Error: Not Found
The ipa-replica-install command failed. See /var/log/ipareplica-install.log for more information

Now, I've learned a fair amount over the past few days digging into this, like what ipa-custodia is, and how to poke it.

It seems that at this point, the process is still actually actively doing things - it appears to be generating some kind of NSS certificate/key store. And that process is failing, because apparently it can't find the key for the entry "auditSigningCert cert-pki-ca" - specifically in custodiainstance.__get_keys the call to cli.fetch_key is failing for this nickname (but no others).

So, more digging, and I find that yes indeed, the private key appears to be missing from the cert database on one of the directory servers (specifically the "first" directory server).

I haven't quite joined the dots on how custodia is working here, but using the following command:
sudo certutil -L -d /etc/pki/pki-tomcat/alias
I can determine that on the first directory server, the trust attributes for this cert are ",,P" whereas on the other two CA directory servers, the trust attributes are "u,u,uP", and that indeed the key is missing from the first directory server in this database.
I also note that the cert databases seem to be divergent in other ways between the CA servers. Which I find interesting.

But anyway, so my next action is to copy the cert databases to another machine and to try to import the cert/key from a "good" CA db to the "bad" CA db using pk12util. 

This gives me a segmentation fault.

So, I try with a new DB. I export all the cert/key pairs from the "bad" CA individually and import them into a new DB, replicating the trust attributes. So far so good. I also export the missing cert/key from a "good" CA and import that into the same new DB. Also apparently good.

The solution?

So, at this point, I feel relatively confident that I have constructed a good DB and I should be able to perform some surgery to remove the old "bad" DB and replace it with this "good" DB.

My questions are:

1. Does this approach seem reasonable or am I oversimplifying?
2. If this is a reasonable approach: what's my best method for performing the surgery? ipactl stop, move bad db directory out of way, move "good" db in, don't forget the selinux stuff, then ipactl start again?
3. How could this even happen in the first place? Is it a known issue?
4. Shouldn't the CA databases basically all look the same between servers created at the same time? Why might they diverge?
5. Do you have any other comments or questions which you feel might be pertinent?

Thanks in advance for any input or insights shared.

Best Regards

Andy




--

Andrew Stubbs, PhD 
Head of Technical Operations 

treatwell.co.uk