Re: merge-base --is-ancestor A B is unreasonably slow with unrelated history B

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Wed, 10 Jan 2018 15:01:22 +0100

On Tue, Jan 09 2018, Derrick Stolee jotted:

> On 1/9/2018 10:17 AM, Ævar Arnfjörð Bjarmason wrote:
>> This is a pathological case I don't have time to dig into right now:
>>
>>      git branch -D orphan;
>>      git checkout --orphan orphan &&
>>      git reset --hard &&
>>      touch foo &&
>>      git add foo &&
>>      git commit -m"foo" &&
>>      time git merge-base --is-ancestor master orphan
>>
>> This takes around 5 seconds on linux.git to return 1. Which is around
>> the same time it takes to run current master against the first commit in
>> linux.git:
>>
>>      git merge-base --is-ancestor 1da177e4c3f4 master
>>
>> This is obviously a pathological case, but maybe we should work slightly
>> harder on the RHS of and discover that it itself is an orphan commit.
>>
>> I ran into this while writing a hook where we'd like to do:
>>
>>      git diff $master...topic
>>
>> Or not, depending on if the topic is an orphan or just something
>> recently branched off, figured I could use --is-ancestor as on
>> optimization, and then discovered it's not much of an optimization.
>
> Ævar,
>
> This is the same performance problem that we are trying to work around
> with Jeff's "Add --no-ahead-behind to status" patch [1]. For commits
> that are far apart, many commits need to be parsed. I think the right
> solution is to create a serialized commit graph that stores the
> adjacency information of the commits and can create commit structs
> quickly. This requires storing the commit id, commit date, parents,
> and root tree id to satisfy the needs of parse_commit_gently(). Once
> the framework for this data is constructed, it is simple to add
> generation numbers to that data and start consuming them in other
> algorithms (by adding the field to 'struct commit').
>
> I'm working on such a patch right now, but it will be a few weeks
> before I'm ready.

Thanks, yes of course I forgot about not being able to rely on
timestamps at all, otherwise this check wouldn't need commit traversal
at all, we could simply note:

 1. A is older than B
 2. B is an orphan commit
 3. Therefore it's impossible that A is an ancestor of B

But we don't enforce any of this in the DAG, so now we need to do a full
traversal to make sure this B committed in 2018 doesn't precede some
commit added in 2005.

This has come up before related to various traversal optimizations,
e.g. [tag|branch] --contains.

I don't know the details of what you have in mind, but have you
considered a solution which involves either computing that all commits
in the repo have a timestamp later than their parents, or alternatively
declaring via some config mechanism that that's true of all (or a subset
of) commits?

E.g. for centrally hosted repos a pre-receive hook could be added to
assert that all incoming commits must have monotonically increasing
timestamps compared to preceding commits, and if that wasn't the case we
could declare that e.g. all commits added after 2019 would have that
property (after ensuring no commits had dates in the future).

It sounds like a subset of what you have in mind, so maybe it's
useless. However for many cases such as this one it should be faster to
almost instantaneously consult such config instead of some side-data we
need to maintain, but it's probably not worth the complexity.

> [1] v5 of --no-ahead-behind
> https://public-inbox.org/git/20180109185018.69164-1-git@xxxxxxxxxxxxxxxxx/T/#t
>
> [2] v4 of --no-ahead-behind
> https://public-inbox.org/git/nycvar.QRO.7.76.6.1801091744540.37@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#t