Re: DAG scalability (was: Git commit generation numbers)

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 17 Jul 2011 15:25:33 -0700

On Sun, Jul 17, 2011 at 3:18 PM, Shawn Pearce <spearce@xxxxxxxxxxx> wrote:
>
> What about `git clone`?  We're always recomputing the entire DAG
> during it. For a public repository like yours that only contains
> public objects, its a horrible abuse of the servers that are serving
> the repository...
>
> Just saying, not everything we do winds up being a partial or
> incomplete traversal in the name of performance.

I don't see your point.

OF COURSE we sometimes traverse the whole tree - when we need all the
data. And it's expensive in those cases, but generally those cases are
also cases where the DAG traversal itself is just a tiny part of the
big picture. The commits tend to be almost irrelevant to "git clone",
for example: it tends to be tree and blob objects that are the biggest
cost.

But there's a lot of common operations that would be much too
expensive unless we had the incomplete DAG traversal code. It's what
makes us able to do sub-second merges, it's what makes "gitk @{6am}.."
be fast, etc etc.

My point really was that the git DAG structure is really simple.
People learn about DAG's in CS courses the first year.

But the kinds of things that git does, which is to try to partition
the DAG without having to walk it entirely - that's rare. I tried to
find papers about optimized DAG walking, and couldn't (but so many
academic papers are behind a pay-wall that I still don't know if there
might be some smart person who came up with a really good algorithm
for what the git-merge-base stuff does, for example)

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html