Re: repospanner and our Ansible repo

Neal Gompa <ngompa13@xxxxxxxxx> · Wed, 18 Sep 2019 10:04:03 -0400

On Wed, Sep 18, 2019 at 9:58 AM Stephen John Smoogen <smooge@xxxxxxxxx> wrote:
>
> On Wed, 18 Sep 2019 at 09:44, Randy Barlow <bowlofeggs@xxxxxxxxxxxxxxxxx> wrote:
> >
> > On Tue, 2019-09-17 at 19:01 -0400, Neal Gompa wrote:
> > > Out of curiosity, do we know where the bottlenecks are in
> > > repoSpanner?
> > > In theory, the architecture of repoSpanner isn't supposed to be too
> > > different from gitaly, so I'm curious where we're falling down.
> >
> > I believe it needs a more efficient way to store the git objects. As I
> > understand it, it currently stores each one in its own file, resulting
> > in a large number of small files.
>
> So my "hot-take probably wrong" look at things seems to indicate that
> the reason it stores everything as a separate file is to make certain
> git actions faster. When you pack the files, searches, diffs and other
> checks become slower or memory intensive because you have to calculate
> new deltas and other things 'lost' in the packing.
>
> Looking at the gitaly documents, I think that is the reason they have
> multiple different types of in-memory caches at different layers. It
> allows for both faster accesses but probably blows up the size of what
> is needed for hardware. We have to be careful here because we don't
> have a hardware reserve to dive into for more memory/cpu.
>
> I think that for gitlab.org (versus running a local gitlab) they also
> use a lot of backend 'eventual' consistency caching. You push and it
> begins to spread that out through the multiple regions it is housed.
> The 'user' doesn't see this because the front end level just directs
> you to the known hot caches for that particular pull/push request..
> but if you somehow were hardcoded to a region you might not see the
> update/change for a while because it hasn't mirrored out completely.
> That also would speed up push/pull/changes greatly and not something
> we could 'duplicate'.
>

That definitely explains the performance consistency between
repoSpanner and gitaly for my local deployment. So it's most likely
related to how they simulate better performance as the backend catches
up.

That said, the most recent change to gitaly is that it now does hashed
storage of git objects and does "fast forking" using alternates
instead of storing as bare git repos and duplicating repos on disk.

None of that changes the initial push for a unique repo.

--
真実はいつも一つ！/ Always, there's only one truth!
_______________________________________________
infrastructure mailing list -- infrastructure@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to infrastructure-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/infrastructure@xxxxxxxxxxxxxxxxxxxxxxx