Hey folks,
I thought I'd make a summary of where I'm at. Here are the issues I found
and what I did about it:
- We ran into an Ansible issue that the PR
https://github.com/ansible/ansible/pull/50381 fixes. I've asked pingou to
patch batcave since it's basically a one-liner that will keep working with
the older prod version.
- When starting a RabbitMQ cluster from scratch, there is a race condition
that is documented here:
https://www.rabbitmq.com/cluster-formation.html#initial-formation-race-co...
On nodes 02 and 03, I've just destroyed the database and let it
auto-detect the cluster again
# systemctl stop rabbitmq-server && rm -rf /var/lib/rabbitmq/mnesia/ &&
systemctl start rabbitmq-server
It worked fine. I checked with "rabbitmqctl list_users" that all nodes
had the same users declared.
- I've also fixed a couple things in the playbooks that assumed the cluster
to be up and setup already.
- I've rebuilt collectd-rabbitmq for EPEL8 but we currently only install it
on production apparently (not sure why, I think it could be useful in
staging.
- The nagios-plugins-rabbitmq RPM still fails to install because of a
dependency bug in perl-Monitoring-Plugin, I've opened a ticket about it:
https://bugzilla.redhat.com/show_bug.cgi?id=1803121
Now, we need to recreate the queues, users and bindings, and I don't have
the permissions to run all the playbooks. If someone could run the master
playbook limited on staging and on the rabbitmq_cluster tag, I think it
should recreate all users and queues and we should be all set.
I'm around and on IRC if you need me.
Aurélien