Re: history damage in linux.git

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 21 Apr 2016 10:43:17 -0700

On Thu, Apr 21, 2016 at 10:23 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> I think avoiding side branches to describe with the weight is a
> right thing to do, i.e. if you have this history:
>
>     X---o---o---o---o---v4.6
>      \             /
>       o-----------o
>
> you do not want to explain X as "v4.6~^2~2", and instead you want it
> as "v4.6~5", even though the former is 4 hops while the latter is 5
> hops (which is longer).

Yes. I have a new suggestion: make the "merge weight" depend on how
complex the name ends up being.

And that algorithm actually gives a completely new and improved path:

   aed06b9 tags/v3.13-rc2~32^2^2~47

which is still complex, but is actually the best one I've found so far.

To compare against the previous ones I looked at:

   v4.6-rc1~9^2~792    <- current git code
   v3.13-rc2~32^2^2~47     <- new with attached patch
   v3.13-rc7~9^2~14^2~42     <- merge weight 500
   v3.13~5^2~4^2~2^2~1^2~42   <- merge weight 1

that new one is actually the second-most dense, and uses the oldest
tag. So it's a good combination of denseness, but still gets the best
actual base tag.

The logic of the patch is that there are different "complexities" in
the naming, and it's not just whether you are following a second
parent, it's also if you're doing generational hops.

So when you do a first-parent change, the name stays simple (the last
number changes), and that means that the "distance" weight is low (so
that's the normal "+1" we do for first-parent.

But if it's not a first parent, there are two different cases:

 - generation > 0: we add "~%d^%d", and we approximate that with "four
characters", and use a cost that is four orders of magnitude higher
(so 10000).

 - generation == 0: we add just "^%d", so generally just two
characters. So use a cost that is just two orders of magnitude higher
(so 100).

In other words, I'm trying to convince people that my patch not only
gives a good result, but that the "weight numbers" I use make some
kind of conceptual sense from a complexity cost angle.

With that, the patch itself is attached.

I think it's better than what "git name-rev" does now. Is it optimal?
No, I think the *optimal* would use some kind of "does one tag contain
the other" logic, and discarding all base names that are just
supersets of another base that still reaches the target.

But this patch is small and simple, and has some excuses for its
behavior. What do people think?

                 Linus
 builtin/name-rev.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/builtin/name-rev.c b/builtin/name-rev.c
index 092e03c3cc9b..0354c8d222e1 100644
--- a/builtin/name-rev.c
+++ b/builtin/name-rev.c
@@ -16,9 +16,6 @@ typedef struct rev_name {
 
 static long cutoff = LONG_MAX;
 
-/* How many generations are maximally preferred over _one_ merge traversal? */
-#define MERGE_TRAVERSAL_WEIGHT 65535
-
 static void name_rev(struct commit *commit,
 		const char *tip_name, int generation, int distance,
 		int deref)
@@ -55,19 +52,26 @@ copy_data:
 			parents;
 			parents = parents->next, parent_number++) {
 		if (parent_number > 1) {
+			int weight;
 			size_t len;
 			char *new_name;
 
 			strip_suffix(tip_name, "^0", &len);
-			if (generation > 0)
+
+			// The extra merge traversal "weight" depends
+			// on how complex the resulting name is.
+			if (generation > 0) {
+				weight = 10000;
 				new_name = xstrfmt("%.*s~%d^%d", (int)len, tip_name,
 						   generation, parent_number);
-			else
+			} else {
+				weight = 100;
 				new_name = xstrfmt("%.*s^%d", (int)len, tip_name,
 						   parent_number);
+			}
 
 			name_rev(parents->item, new_name, 0,
-				distance + MERGE_TRAVERSAL_WEIGHT, 0);
+				distance + weight, 0);
 		} else {
 			name_rev(parents->item, tip_name, generation + 1,
 				distance + 1, 0);