Re: [PATCH v6 00/14] Serialized Git Commit Graph

SZEDER Gábor <szeder.dev@xxxxxxxxx> · Fri, 16 Mar 2018 20:48:49 +0100

On Fri, Mar 16, 2018 at 7:33 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> SZEDER Gábor <szeder.dev@xxxxxxxxx> writes:
>
>> You should forget '--stdin-packs' and use '--stdin-commits' to generate
>> the initial graph, it's much faster even without '--additive'[1].  See
>>
>>   https://public-inbox.org/git/CAM0VKj=wmkBNH=psCRztXFrC13RiG1EaSw89Q6LJaNsdJDEFHg@xxxxxxxxxxxxxx/
>>
>> I still think that the default behaviour for 'git commit-graph write'
>> should simply walk history from all refs instead of enumerating all
>> objects in all packfiles.
>
> Somehow I missed that one.  Thanks for the link to it.
>
> It is not so surprising that history walking runs rings around
> enumerating objects in packfiles, if packfiles are built well.
>
> A well-built packfile tends to has newer objects in base form and
> has delta that goes in backward direction (older objects are
> represented as delta against newer ones).  This helps warlking from
> the tips of the history quite a bit, because your delta base cache
> will tend to have the base object (i.e. objects in the newer part of
> the history you just walked) that will be required to access the
> "next" older part of the history more often than not.
>
> Trying to read the objects in the pack in their object name order
> would essentially mean reading them in a cryptgraphically random
> order.  Half the time you will end up wanting to access an object
> that is near the tip of a very deep delta chain even before you've
> accessed any of the base objects in the delta chain.

I came up with a different explanation back then: we are only interested
in commit objects when creating the commit graph, and only a small-ish
fraction of all objects are commit objects, so the "enumerate objects in
packfiles" approach has to look at a lot more objects:

  # in my git fork
  $ git rev-list --all --objects |cut -d' ' -f1 |\
    git cat-file --batch-check='%(objecttype) %(objectsize)' >type-size
  $ grep -c ^commit type-size
  53754
  $ wc -l type-size
  244723 type-size

I.e. only about 20% of all objects are commit objects.

Furthermore, in order to look at an object it has to be zlib inflated
first, and since commit objects tend to be much smaller than trees and
especially blobs, there are a lot less bytes to inflate:

  $ grep ^commit type-size |cut -d' ' -f2 |avg
  34395730 / 53754 = 639
  $ cat type-size |cut -d' ' -f2 |avg
  3866685744 / 244723 = 15800

So a simple revision walk inflates less than 1% of the bytes that the
"enumerate objects packfiles" approach has to inflate.

>> [1] - Please excuse the bikeshed: '--additive' is such a strange
>>       sounding option name, at least for me.  '--append', perhaps?
>
> Yeah, I think "fetch --append" is probably a precedence.