On Wed, Oct 16, 2019 at 09:41:03AM -0700, Kevin Fenzi wrote:
On Wed, Oct 16, 2019 at 10:47:00AM +0200, Pierre-Yves Chibon wrote:
> Good Morning Everyone,
>
> This morning I found out that
https://pagure.io/fedora-infrastructure was not
> available, it was throwing a 500 error on every page/call.
>
> I checked the logs and found:
> GitError: Error performing curl request: (60): Peer certificate cannot be
> authenticated with given CA certificates
>
> The combination and "GitError" and a SSL related error led me to
repoSpanner.
> So with the help of Patrick, we confirmed that the SSL cert for pagure01 was
> expiring on Oct 15th 2019.
> We then regenerated that SSL cert.
>
> We thought the repospanner playbook was going to redeploy that cert so I ran it,
> but it did not change anything (both in its run as well as in the symptoms
> observed).
>
> We then found out that this piece is actually part of the pagure.yml playbook,
> so I've ran it with `-t repospanner/server` to limit its effect.
> Then I've restarted httpd, stunnel and repospanner(a)ansible.service on pagur01.
> The first two were likely not necessary, the last one was to get the new cert in
> use.
>
> So I would like retro-active approval for my actions since the systems I've
> touched are frozen.
So a few things:
1) +1 to the actions... thanks for fixing that!
Thanks for the +1!
2) we need nagios monitoring those certs, or we need to just tear
down that cluster if we aren't going to use it (which we are currently
not).
3) We could also 'unrepospanner' that repo since we aren't using it
and put the old one back.
This may be wise, especially considering that I may not have fixed everything
(see the end of this email).
4) pagure perhaps should gracefully print 'sorry, the repo is
not
available right now due to a repospanner problem' but otherwise work?
+1 for this, I'm not sure of the size of the work in there but worth looking
into.
Also: Patrick said that the cert needs to be upgraded in other places (nodes) as
well, I do not know if running the repospanner playbook fixed it or not though,
so we may still have something broken.
I have received emails from pagure yesterday with:
"""
...
PagurePushDenied: Remote hook declined the push: Performing pre-check...
...
ERR Error syncing object out to enough nodes
"""
Which make me think we are still missing some fix, but I don't know which :(
Thanks,
Pierre