Re: [QUESTION] Performance comparison: full clone + sparse-checkout vs partial clone + sparse-checkout

Elijah Newren <newren@xxxxxxxxx> · Fri, 8 Nov 2024 09:24:00 -0800

On Wed, Nov 6, 2024 at 8:52 PM Manoraj K <mkenchugonde@xxxxxxxxxxxxx> wrote:
>
> Bump
>
> On Mon, Oct 28, 2024 at 4:00 PM Manoraj K <mkenchugonde@xxxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > We've conducted benchmarks comparing Git operations between a fully
> > cloned and partially cloned repository (both using sparse-checkout).
> > We'd like to understand the technical reasons behind the consistent
> > performance gains we're seeing in the partial clone setup.
> >
> > Benchmark Results:
> >
> > Full Clone + Sparse-checkout:
> > - .git size: 8.5G
> > - Git index size: 20MB
> > - Pack objects: 18,761,646
> > - Operations (mean ± std dev):
> >   * git status: 0.634s ± 0.004s
> >   * git commit: 2.677s ± 0.019s
> >   * git checkout branch: 0.615s ± 0.005s
> >   * git pull (no changes): 5.983s ± 0.391s
> >
> > Partial Clone + Sparse-checkout:
> > - .git size: 2.0G
> > - Git index size: 20MB
> > - Pack objects: 13,560,436
> > - Operations (mean ± std dev):
> >   * git status: 0.575s ± 0.012s (9.3% faster)
> >   * git commit: 2.164s ± 0.032s (19.2% faster)
> >   * git checkout branch: 0.724s ± 0.154s
> >   * git pull (no changes): 1.866s ± 0.018s (68.8% faster)
> >
> > Key Questions:
> > 1. What are the technical factors causing these performance
> > improvements in the partial clone setup?
> > 2. To be able to get these benefits, is there a way to convert our
> > existing fully cloned repository to behave like a partial clone
> > without re-cloning from scratch?
> >
> > Appreciate any insights here.
> >
> > Best regards,
> > Manoraj K

Taking some wild guesses:

`git pull` will both fetch updates for _all_ branches, as well as
merge (or rebase) the updates for the current branch.  Your "no
changes" probably means there's no merge/rebase needed, but that
doesn't mean there was nothing to fetch.  A partial clone isn't going
to download all the blobs, so it has much less to download and is thus
significantly faster.

`git checkout branch` would likely be slower in a partial clone
because sometimes objects are missing and need to be downloaded.  And
indeed, it shows as being a little slower for you.

`git status` is harder to guess at.  The only guess I can come up with
for this case is that fewer objects means faster lookup (I'm not
familiar with the packfile code, but  think object lookups use a
bisect to find the objects in question, and fewer objects to bisect
would make things faster if so); not sure if this could account for a
9% difference, though.  Maybe someone who understands packfiles,
object lookup, and promisor remotes has a better idea here?

I'm a bit surprised by the `git commit` case; how can it take so long
on your repo (2-3s)?  Do you have commit hooks in place?  If so, what
are they doing?  (And if you do, I suspect whatever they are doing is
responsible for the differences in timings between the partial clone
and the full clone, so you'd need to dig into them.)