Junio C Hamano <gitster@xxxxxxxxx> writes: > Thomas Rast <trast@xxxxxxxxxxxxxxx> writes: > >> The right fix would be to dig up Peff's work on generation number >> caching, and modify the algorithm to take generation numbers into >> account. > > I think you are totally wrong here, unless you are talking about a > generation number that is different from what I recall from the > older discussion. Hrm, now you're making me feel like I missed something. But my reasoning goes roughly like this. Consider the commit graph as a DAG, where the parent pointers are directed edges from child to parent. Thus, tip commits are sources and root commits are sinks. One way to get a valid topo ordering, knowing the complete DAG, is: while there are vertices left: remove any source and emit it (You can also generate it from the sinks instead; I haven't actually checked what the code currently does. But for the following, we need to start from the sources.) To implement this efficiently, one usually keeps track of the set S of sources, and inserts vertices that (by removal of their last in-neighbour) have become a source. The distinction between topo-order and date-order is the preference given in the choice of source. IIUC, date-order prefers the newest source by committer date (using a priority queue or such for S), and topo-order prefers the source(s) which were parents of the vertex removed in the previous iteration (using a stack for S). The problem with this algorithm is that it requires knowledge of the DAG[1]. So with --topo-order, git dutifully loads the whole commit graph, and then runs the algorithm, leading to the discussed startup delay. However, suppose we knew generation numbers. I haven't actually looked into the old threads again, but my understanding was that they are numbers g(C) attached to each commit C such that g(C) = 1 + max(g(P) for P a parent of C) for non-root commits g(C) = 0 for root commits They are invariant given the commit, so they can be cached. And they allow us to infer something about ancestry: if g(C)<=g(D), then C cannot possibly be a descendant of D. The topo order algorithm can be modified to take advantage of them, in order to provide incremental processing: Let S be the set of tentative sources Let U be the set of vertices whose out-edges are no known yet (i.e., the set of commits which haven't been loaded yet) Let max g(U) denote "max g(C) taken over all C in U". Fill S and U from the revisions given: load all positive revisions, let's call this set P put all zero-indegree members of P in S put all parents of members of P in U (as far as they are not also members of P, and therefore loaded already) while there are any vertices left: pick any tentative source C from S that we "want to emit" # Ascertain that no unknown commit (from U or further beyond) can be # a descendant of C while there is a D in U such that g(D) > g(C): load D remove D from U add the parents of D to U if they were not already loaded possibly remove some elements of S if their indegree became nonzero if C was removed from S: continue remove C from the graph and emit it I hope I got that right. The order of commits is still entirely determined by the choice of "any tentative source", but the algorithm should now stream nicely once the generation numbers are known. Footnotes: [1] If there are any parent pointers not yet followed, they may in fact wind up (over at least one more commit) pointing to the commit you wanted to emit next, making that an invalid choice. -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html