Re: [PATCH v3 0/4] Speed up unpack_trees()

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 08 Aug 2018 10:46:38 -0700

Duy Nguyen <pclouds@xxxxxxxxx> writes:

> On Mon, Aug 6, 2018 at 5:48 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>> > I've also checked about the lookahead thing in unpack_trees() to see
>> > if we accidentally break something there, which is my biggest worry.
>> > See [1] and [2] for context, but I believe since we can't have D/F
>> > conflicts, the situation where lookahead is needed will not occur. So
>> > we should be safe.

I think you would want the same "switch cache-bottom before
descending into a subdirectory, and then restore cache-bottom after
traversal comes back" dance that is done for the normal tree
traversal case to happen.

	bottom = switch_cache_bottom(&newinfo);
	ret = traverse_trees(n, t, &newinfo);
	restore_cache_bottom(&newinfo, bottom);

During your walk of the index and the trees that are known to be in
sync, there is little reason to worry about the cache_bottom, which
is advanced by calling mark_ce_used() in traverse_by_cache_tree().
Where it matters is what happens after the traversal comes back out
of the subtree.  find_cache_pos() uses the bottom pointer so that it
does not have to go back to far to find an index entry that has not
been used to match with the entries from the trees (which are not
sorted exactly the same way as the index, unfortunately), so
forgetting to advance the bottom pointer while correctly marking a
ce as "used" is OK (i.e. hurts performance but not correctness), but
advancing the bottom pointer too much and leaving entries that are
not used behind is *not* OK.  And lack of restoring the bottom in
the new codepath makes me suspect exactly such a bug _after_ the
traversal exits the subtree we are using this new optimization in
and moves on.

Imagine we are iterating over the top-level of the trees, and found
a subtree in them.  There may be some index entries before the first
path in this subtree that are not yet marked as "used", the earliest
of which is pointed at by the cache_bottom pointer.

Before descending into the subtree (and start consuming the entry
with the first in this subtree from the index), we stash away the
current cache_bottom, and then start walking the subtree.  While we
are in that subtree, the cache_bottom starts from the first name in
the subtree and increments, as we _know_ the entry at the old
cache_bottom is outside this subtree and will not match any entry
form the subtree.  Then when the traversal returns, all index
entries within the "subtree/" path will be marked "used".  At that
point, when we continue to scan the top-level of the trees, we need
to restore the cache_bottom, so that we do not forget entries that
we knew we needed to scan eventually, if there was any.