[no subject]

**Date** **Thread**

I was discussing this a bit off-list with Peff (who I hope will join the
thread and share his own thoughts), but I wonder if it was a mistake to
discard your '--full-name-hash' idea (or something similar, which I'll
discuss in a bit below) from earlier.

(Repeating a few things that I am sure are obvious to you out loud so
that I can get a grasp on them for my own understanding):

It seems that the problems you've identified which result in poor repack
performance occur when you have files at the same path, but they get
poorly sorted in the delta selection window due to other paths having
the same final 16 characters, so Git doesn't see that much better delta
opportunities exist.

Your series takes into account the full name when hashing, which seems
to produce a clear win in many cases. I'm sure that there are some cases
where it presents a modest regression in pack sizes, but I think that's
fine and probably par for the course when making any changes like this,
as there is probably no easy silver bullet here that uniformly improves
all cases.

I suspect that you could go even further and intern the full path at
which each object occurs, and sort lexically by that. Just stringing
together all of the paths in linux.git only takes 3.099 MiB on my clone.
(Of course, that's unbounded in the number of objects and length of
their pathnames, but you could at least bound the latter by taking only
the last, say, 128 characters, which would be more than good enough for
the kernel, whose longest path is only 102 characters).

Some of the repositories that you've tested on I don't have easy access
to, so I wonder if either doing (a) that, or (b) using some fancier
context-sensitive hash (like SimHash or MinHash) would be beneficial.

I realize that this is taking us back to an idea you've already
presented to the list, but I think (to me, at least) the benefit and
simplicity of that approach has only become clear to me in hindsight
when seeing some alternatives. I would like to apologize for the time
you spent reworking this series back and forth to have the response be
"maybe we should have just done the first thing you suggested". Like I
said, I think to me it was really only clear in hindsight.

In any event, the major benefit to doing --full-name-hash would be that
*all* environments could benefit from the size reduction, not just those
that don't rely on certain other features.

Perhaps just --full-name-hash isn't quite as good by itself as the
--path-walk implementation that this series starts us off implementing.
So in that sense, maybe we want both, which I understand was the
original approach. I see a couple of options here:

  - We take both, because doing --path-walk on top represents a
    significant enough improvement that we are collectively OK with
    taking on more code to improve a more narrow (but common) use-case.

  - Or we decide that either the benefit isn't significant enough to
    warrant an additional and relatively complex implementation, or in
    other words that --full-name-hash by itself is good enough.

Again, I apologize for not having a clearer picture of this all to start
with, and I want to tell you specifically and sincerely that I
appreciate your patience as I wrap my head around all of this. I think
the benefit of --full-name-hash is much clearer and appealing to me now
having had both more time and seeing the series approached in a couple
of different ways. Let me know what you think.

Thanks,
Taylor