I was discussing this a bit off-list with Peff (who I hope will join the thread and share his own thoughts), but I wonder if it was a mistake to discard your '--full-name-hash' idea (or something similar, which I'll discuss in a bit below) from earlier. (Repeating a few things that I am sure are obvious to you out loud so that I can get a grasp on them for my own understanding): It seems that the problems you've identified which result in poor repack performance occur when you have files at the same path, but they get poorly sorted in the delta selection window due to other paths having the same final 16 characters, so Git doesn't see that much better delta opportunities exist. Your series takes into account the full name when hashing, which seems to produce a clear win in many cases. I'm sure that there are some cases where it presents a modest regression in pack sizes, but I think that's fine and probably par for the course when making any changes like this, as there is probably no easy silver bullet here that uniformly improves all cases. I suspect that you could go even further and intern the full path at which each object occurs, and sort lexically by that. Just stringing together all of the paths in linux.git only takes 3.099 MiB on my clone. (Of course, that's unbounded in the number of objects and length of their pathnames, but you could at least bound the latter by taking only the last, say, 128 characters, which would be more than good enough for the kernel, whose longest path is only 102 characters). Some of the repositories that you've tested on I don't have easy access to, so I wonder if either doing (a) that, or (b) using some fancier context-sensitive hash (like SimHash or MinHash) would be beneficial. I realize that this is taking us back to an idea you've already presented to the list, but I think (to me, at least) the benefit and simplicity of that approach has only become clear to me in hindsight when seeing some alternatives. I would like to apologize for the time you spent reworking this series back and forth to have the response be "maybe we should have just done the first thing you suggested". Like I said, I think to me it was really only clear in hindsight. In any event, the major benefit to doing --full-name-hash would be that *all* environments could benefit from the size reduction, not just those that don't rely on certain other features. Perhaps just --full-name-hash isn't quite as good by itself as the --path-walk implementation that this series starts us off implementing. So in that sense, maybe we want both, which I understand was the original approach. I see a couple of options here: - We take both, because doing --path-walk on top represents a significant enough improvement that we are collectively OK with taking on more code to improve a more narrow (but common) use-case. - Or we decide that either the benefit isn't significant enough to warrant an additional and relatively complex implementation, or in other words that --full-name-hash by itself is good enough. Again, I apologize for not having a clearer picture of this all to start with, and I want to tell you specifically and sincerely that I appreciate your patience as I wrap my head around all of this. I think the benefit of --full-name-hash is much clearer and appealing to me now having had both more time and seeing the series approached in a couple of different ways. Let me know what you think. Thanks, Taylor