As you may know, currently database backups on db-koji01 are causing
very heavy load, disrupting our users builds (
so, they are currently disabled.
However, not having current backups is not a good thing, IMHO.
So, I am considering the idea of adding a db-koji02 vm (also rhel7 using
the same postgres version that db-koji01 is) and enabling streaming
replication from db-koji01 -> db-koji02 and then once thats working, run
the database backups on db-koji02.
It turns out this doesn't require that many changes on db-koji01:
* adding a replication user
* Setting 3 new lines of postgresql config and restarting:
wal_level = 'hot_standby'
max_wal_senders = 10
wal_keep_segments = 100
(May need to adjust senders and keep segments)
All the other changes are on db-koji02:
* create/setup the vm
* run pg_basebackup to pull all the current data from 01
* setup postgresql.conf and recovery.conf files
* start server and confirm it keeps up with 01
* run pg_dump and confirm it keeps up with 01
This is, of course a really big change in a freeze to a criticial
service, so I'd like to get thoughts from others about it.
Should we wait until after freeze and do without backups until then
(note, that we have never had to restore this db from backups in the
past, although we have dumped/restored it to move to newer postgres
Is there something else we can do thats easier to mitigate the issues
Thoughts? ideas? Rotten fruit?
Good Morning Everyone,
This morning I found out that https://pagure.io/fedora-infrastructure was not
available, it was throwing a 500 error on every page/call.
I checked the logs and found:
GitError: Error performing curl request: (60): Peer certificate cannot be
authenticated with given CA certificates
The combination and "GitError" and a SSL related error led me to repoSpanner.
So with the help of Patrick, we confirmed that the SSL cert for pagure01 was
expiring on Oct 15th 2019.
We then regenerated that SSL cert.
We thought the repospanner playbook was going to redeploy that cert so I ran it,
but it did not change anything (both in its run as well as in the symptoms
We then found out that this piece is actually part of the pagure.yml playbook,
so I've ran it with `-t repospanner/server` to limit its effect.
Then I've restarted httpd, stunnel and repospanner(a)ansible.service on pagur01.
The first two were likely not necessary, the last one was to get the new cert in
So I would like retro-active approval for my actions since the systems I've
touched are frozen.
This is a freeze break request to enable the new mirrorlist server on
proxy14 as discussed on the mailing list.
I hope my conditionals are correct for the Ansible and Jinja2 files.
If this freeze break request gets accepted someone needs to run the
playbook against proxy14.
Before running the playbook, proxy14 should be removed from DNS to make
sure, that the old mirrorlist containers are correctly stopped and
deleted and that the new mirrorlist containers are correctly running.
Adrian Reber (1):
Enable new mirrorlist server on proxy14
roles/mirrormanager/backend/files/backend.cron | 8 +++++---
.../backend/templates/sync_pkl_to_mirrorlists.sh | 2 +-
roles/mirrormanager/mirrorlist_proxy/tasks/main.yml | 2 +-
.../mirrorlist_proxy/templates/mirrorlist.service.j2 | 4 ++--
4 files changed, 9 insertions(+), 7 deletions(-)
You are kindly invited to the meeting:
Fedora Infrastructure on 2019-10-17 from 15:00:00 to 16:00:00 UTC
The meeting will be about:
Weekly Fedora Infrastructure meeting. See infrastructure list for agenda a day before.
Hey, folks. Requesting a freeze break for this PR (as it applies to
In F31 'dnf-yum' is no more and 'yum' obsoletes it, but this was not
changed in comps. As a result, clean installs of F31 (and Rawhide) have
no 'yum' command, as we clearly intend they should.
This isn't likely to make any images go oversize as all the 'yum'
package contains is a symlink linking /usr/bin/yum to /usr/bin/dnf-3
and a manpage; /usr/bin/dnf-3 is part of python3-dnf which would
already be in all the images.
Fedora QA Community Monkey
IRC: adamw | Twitter: AdamW_Fedora | XMPP: adamw AT happyassassin . net
I spent the last few weeks studying repoSpanner with the goal of
developing a plan to improve its performance. I started by testing its
performance with a few common git operations with a couple repos (our
Infrastructure Ansible repository since it is on the large side, and
Bodhi since I had it cloned already and is perhaps a "typical" medium
sized project). I wrote an initial report about those tests here.
Since the time of that report, I have done some performance profiling
on the git push for the Bodhi repository, since that by far was the
slowest operation that I tested.
I found that the most significant time was spent interacting with
sqlite. sqlite is used today by repoSpanner as a task queue. There are
two different workflows. The first is that it creates a table per
repoSpanner node, and each row of the table represents a git object ID
that needs to be pushed to that node. The second is that there is
another table that tracks each object ID along with how many nodes that
particular object ID has been successfully pushed to.
Early on in my sprint, I was able to find an easy way to gain a speed
boost - I found that the query to retrieve a node table's object ID was
being called once per node per object ID, resulting in very large
numbers of read queries (as an example, the Bodhi repo has 40k objects,
so if I had a 3 node cluster, this would result in 80k SELECT
statements, since there will be tables to sync those objects to the
other two nodes). It was relatively easy to refactor the code to
retrieve a group of object IDs per query and get a quick win. I posted
up a pull request with a patch that does this that achieved a 51% boost
on pushing Bodhi into repoSpanner.
After achieving that gain, I attempted to continue down a similar path
as the next significant block seemed to be the code that wrote the data
into that table. However, it quickly became clear that it was a more
significant refactor to alter the writing code to batch insert than it
had been to alter the reading code to batch select. If I was going to
have to do a larger refactor, it became clear that it would be worth
exploring designs that avoid or reduce the use of sqlite. I had reached
a "local minima", so to speak.
I had a few calls with Patrick Uiterwijk, and it turned out that he had
also been thinking about ways to solve this problem, and he was in
favor of removing sqlite from the project. He gave me the background on
why sqlite had been used in the first place, and suggested that we
could create a file backed go chan to achieve similar goals with higher
Last week I put together a prototype of the "file backed chan" that he
and I designed together and I also refactored the repoSpanner code to
use the new chan. This is very much prototype and not at all pull
request worthy code (at the time of writing, it contains a git commit
with the message "Test", if that tells you anything), so please be
forgiving of its messy state, but for those who are curious, you can
see what I've been experimenting with at .
I've found that I am able to push the Bodhi repository into repoSpanner
in about 25 minutes with that patch, where it took about 58 minutes
before. This is approximately a 57% speed improvement, which is a
little bit better than the 51% speed improvement of the other patch.
There is still one remaining use of sqlite - the table that records how
many nodes that each object has been synced to. This is now the largest
bottleneck in repoSpanner push performance and is the next obvious
thing to eliminate. I've talked to Patrick about some ideas around
this, and we are considering eliminating the feature of tracking each
object individually and instead tracking the entire operation - i.e.,
consider a push successful only if all objects made it together to the
same majority of nodes. This is in contrast to today's feature, where
each object is considered individually successfully pushed if it made
it to a majority of nodes - i.e., it allows the objects not to have to
make it to the *same* majority of nodes. If we eliminate that feature,
we no longer have to perform individual tracking of which git objects
made it to which nodes and we can eliminate sqlite entirely. I expect
this will make the most significant difference to the performance of
git push, though it is difficult to estimate how much of a difference
it will make without prototyping it.
Another area that is known to be problematic is the speed of a git
pull. Today repoSpanner builds gitpack files for the repo every time it
is pulled. I haven't done very much profiling here, but Patrick has
suggested caching git pack files to help in this area. I think it's an
area we should focus on improving in the future.
As for the immediate future, I plan to clean up my patches for the
sqlite changes I have been experimenting with this week so I can
propose them in a pull request. They will supercede my existing pull
request, so I plan to close that one. Then I think it will be sensible
to do another prototype/sprint where we explore eliminating sqlite
Thanks for reading, and let me know if you have any ideas or questions!
 As written in , I tested a git push to a new repository, a git
clone, a git push of a new commit, and a git pull of a new commit.
 He wanted to avoid keeping large numbers of objects in memory,
while also allowing users to push objects faster than nodes could
write them. sqlite was an easy way to achieve this, since it
records the data to disk with an easily addressable and well known
Fedora's complete MirrorManager setup is still running on Python2. The
code has been ported to Python3 probably over two years ago but we have
not switched yet. One of the reasons is that the backend is running on
RHEL7 which means we are not in a hurry to deploy the Python3 version.
The mirrorlist server which is answering the actual dnf/yum queries for
a mirrorlist/metalink is, however, running in a Fedora 29 container.
This container also still uses Python2 and it actually cannot use the
One of MirrorManager's design points is that the mirrorlist servers,
which are answering around 27 000 000 requests per day, are not directly
accessing the database. The backend creates a snapshot of the relevant
data (113MB) and the mirrorlist servers are using this snapshot to
answer client requests.
This data exchange is based on Python's pickle format and that does not
seem to work with Python3 if it is generated using Python2.
Having used protobuf before, I added code to also export the data for the
mirrorlist servers based on protobuf.
The good news with protobuf is, that the resulting file is only 66MB
instead of 113MB. The bad news is, that loading it from Python requires
3.5 times the amount of memory during runtime (3.5GB instead of 1GB).
In addition to the data exchange problems between backend and
mirrorlist servers the architecture of the mirrorlist server does not
really make sense today. 12 years ago it made a lot of sense as it could
be easily integrated into httpd and it could be easily reloaded without
stopping the service. Today the mirrorlist server and httpd is all part
of a container which is then behind haproxy. So there is a lot of
infrastructure in the container which is not really useful.
To get rid of the pickle format and to have a simpler architecture I
reimplemented the mirrorlist-server in Rust. This was brought up some
time ago on a ticket and with the protobuf problems I was seeing in
Python it made sense to try it out.
My code currently can be found at https://github.com/adrianreber/mirrorlist-server
and so far the results from the new mirrorlist server are the same as
from the Python based mirrorlist server.
It requires less than 700MB instead of the 1GB in Python with production
based data and seems really fast.
I have set up a test instance with the mirror data from Sunday at:
The instance is based on the container I pushed to quay.io:
$ podman run quay.io/adrianreber/mirrorlist-server:latest -h
With this change the mirrorlist server would also finally switch to
geoip2. The currently running mirrorlist server still uses the legacy
After the Fedora 31 freeze I would like to introduce this new mirrorlist
server implementation on the proxies. I already verified that I can run
this mirrorlist container rootless. This new container can be a drop-in
replacement for the current container and no infrastructure around it
needs to be changed.
The main changes to get it into production is to change mirrorlist1.service
and mirrorlist2.service to include a line "User=mirrormanager" and
replace the current container name with new container.