On Wed, Sep 18, 2019 at 9:58 AM Stephen John Smoogen <smooge(a)gmail.com> wrote:
On Wed, 18 Sep 2019 at 09:44, Randy Barlow <bowlofeggs(a)fedoraproject.org> wrote:
>
> On Tue, 2019-09-17 at 19:01 -0400, Neal Gompa wrote:
> > Out of curiosity, do we know where the bottlenecks are in
> > repoSpanner?
> > In theory, the architecture of repoSpanner isn't supposed to be too
> > different from gitaly, so I'm curious where we're falling down.
>
> I believe it needs a more efficient way to store the git objects. As I
> understand it, it currently stores each one in its own file, resulting
> in a large number of small files.
So my "hot-take probably wrong" look at things seems to indicate that
the reason it stores everything as a separate file is to make certain
git actions faster. When you pack the files, searches, diffs and other
checks become slower or memory intensive because you have to calculate
new deltas and other things 'lost' in the packing.
Looking at the gitaly documents, I think that is the reason they have
multiple different types of in-memory caches at different layers. It
allows for both faster accesses but probably blows up the size of what
is needed for hardware. We have to be careful here because we don't
have a hardware reserve to dive into for more memory/cpu.
I think that for
gitlab.org (versus running a local gitlab) they also
use a lot of backend 'eventual' consistency caching. You push and it
begins to spread that out through the multiple regions it is housed.
The 'user' doesn't see this because the front end level just directs
you to the known hot caches for that particular pull/push request..
but if you somehow were hardcoded to a region you might not see the
update/change for a while because it hasn't mirrored out completely.
That also would speed up push/pull/changes greatly and not something
we could 'duplicate'.
That definitely explains the performance consistency between
repoSpanner and gitaly for my local deployment. So it's most likely
related to how they simulate better performance as the backend catches
up.
That said, the most recent change to gitaly is that it now does hashed
storage of git objects and does "fast forking" using alternates
instead of storing as bare git repos and duplicating repos on disk.
None of that changes the initial push for a unique repo.
--
真実はいつも一つ!/ Always, there's only one truth!