repoSpanner sprint report

Monday, 14 October 2019

Greetings ya'll,

I spent the last few weeks studying repoSpanner with the goal of
developing a plan to improve its performance. I started by testing its
performance with a few common git operations with a couple repos (our
Infrastructure Ansible repository since it is on the large side, and
Bodhi since I had it cloned already and is perhaps a "typical" medium
sized project). I wrote an initial report about those tests here[0].

Since the time of that report, I have done some performance profiling
on the git push for the Bodhi repository, since that by far was the
slowest operation that I tested[1].

I found that the most significant time was spent interacting with
sqlite. sqlite is used today by repoSpanner as a task queue. There are
two different workflows. The first is that it creates a table per
repoSpanner node, and each row of the table represents a git object ID
that needs to be pushed to that node. The second is that there is
another table that tracks each object ID along with how many nodes that
particular object ID has been successfully pushed to.

Early on in my sprint, I was able to find an easy way to gain a speed
boost - I found that the query to retrieve a node table's object ID was
being called once per node per object ID, resulting in very large
numbers of read queries (as an example, the Bodhi repo has 40k objects,
so if I had a 3 node cluster, this would result in 80k SELECT
statements, since there will be tables to sync those objects to the
other two nodes). It was relatively easy to refactor the code to
retrieve a group of object IDs per query and get a quick win. I posted
up a pull request with a patch that does this that achieved a 51% boost
on pushing Bodhi into repoSpanner[2].

After achieving that gain, I attempted to continue down a similar path
as the next significant block seemed to be the code that wrote the data
into that table. However, it quickly became clear that it was a more
significant refactor to alter the writing code to batch insert than it
had been to alter the reading code to batch select. If I was going to
have to do a larger refactor, it became clear that it would be worth
exploring designs that avoid or reduce the use of sqlite. I had reached
a "local minima", so to speak.

I had a few calls with Patrick Uiterwijk, and it turned out that he had
also been thinking about ways to solve this problem, and he was in
favor of removing sqlite from the project. He gave me the background on
why sqlite had been used in the first place[3], and suggested that we
could create a file backed go chan to achieve similar goals with higher
performance.

Last week I put together a prototype of the "file backed chan" that he
and I designed together and I also refactored the repoSpanner code to
use the new chan. This is very much prototype and not at all pull
request worthy code (at the time of writing, it contains a git commit
with the message "Test", if that tells you anything), so please be
forgiving of its messy state, but for those who are curious, you can
see what I've been experimenting with at [4].

I've found that I am able to push the Bodhi repository into repoSpanner
in about 25 minutes with that patch, where it took about 58 minutes
before. This is approximately a 57% speed improvement, which is a
little bit better than the 51% speed improvement of the other patch.

There is still one remaining use of sqlite - the table that records how
many nodes that each object has been synced to. This is now the largest
bottleneck in repoSpanner push performance and is the next obvious
thing to eliminate. I've talked to Patrick about some ideas around
this, and we are considering eliminating the feature of tracking each
object individually and instead tracking the entire operation - i.e.,
consider a push successful only if all objects made it together to the
same majority of nodes. This is in contrast to today's feature, where
each object is considered individually successfully pushed if it made
it to a majority of nodes - i.e., it allows the objects not to have to
make it to the *same* majority of nodes. If we eliminate that feature,
we no longer have to perform individual tracking of which git objects
made it to which nodes and we can eliminate sqlite entirely. I expect
this will make the most significant difference to the performance of
git push, though it is difficult to estimate how much of a difference
it will make without prototyping it.

Another area that is known to be problematic is the speed of a git
pull. Today repoSpanner builds gitpack files for the repo every time it
is pulled. I haven't done very much profiling here, but Patrick has
suggested caching git pack files to help in this area. I think it's an
area we should focus on improving in the future.

As for the immediate future, I plan to clean up my patches for the
sqlite changes I have been experimenting with this week so I can
propose them in a pull request. They will supercede my existing pull
request, so I plan to close that one. Then I think it will be sensible
to do another prototype/sprint where we explore eliminating sqlite
entirely.

Thanks for reading, and let me know if you have any ideas or questions!

[0]
https://lists.fedoraproject.org/archives/list/infrastructure@lists.fedora...
[1] As written in [0], I tested a git push to a new repository, a git
    clone, a git push of a new commit, and a git pull of a new commit.
[2] https://github.com/repoSpanner/repoSpanner/pull/91
[3] He wanted to avoid keeping large numbers of objects in memory,
    while also allowing users to push objects faster than nodes could
    write them. sqlite was an easy way to achieve this, since it
    records the data to disk with an easily addressable and well known
    API.
[4]
https://github.com/repoSpanner/repoSpanner/compare/master...bowlofeggs:fi...

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

repoSpanner sprint report