On Mon, Nov 04, 2024 at 10:48:49AM -0500, Derrick Stolee wrote: > > I was discussing this a bit off-list with Peff (who I hope will join the > > thread and share his own thoughts), but I wonder if it was a mistake to > > discard your '--full-name-hash' idea (or something similar, which I'll > > discuss in a bit below) from earlier. > > I'd be happy to resurrect that series, adding in the learnings from > working on the path-walk feature. It helps that the current series adds > the path-walk API and has no conflicting changes in the pack-objects or > repack builtins. (I can handle those conflicts as things merge down.) Adding my two cents, the discussion we had came after reading this post: https://www.jonathancreamer.com/how-we-shrunk-our-git-repo-size-by-94-percent/ I think a few of the low-level details in there are confusing, but it seemed to me that most of the improvement he mentions is just about finding better delta candidates. And it seems obvious that our current pack_name_hash() is pretty rudimentary as context-sensitive hashing goes and won't do well for long paths with similar endings. So just swapping that out for something better seems like an easy thing to do regardless of whether we pursue --path-walk. It doesn't drastically change how we choose delta pairs so it's not much code and it shouldn't conflict with other features. And the risk of making anything worse should be pretty low. I wouldn't at all be surprised if --path-walk can do better, but if we do the easy thing first then I think it gives us a better idea of the cost/benefit it's providing. I suspect there's room for both in the long run. You seem to be focused on push size and cost, whereas I think Taylor and I are more interested in overall repo size and cost of serving bitmapped fetches. > When the optimization idea is to focus on the full path and not care > about the "locality" of the path name by its later bits, storing the > full name in a list and storing an index into that list would have a > very similar effect. > > I'd be interested to explore the idea of storing the full path name. > Based on my exploration with the 'test-tool name-hash' test helper in > that series, I'm not sure that we will make significant gains by doing > so. Worth trying. The way I look at it is a possible continuum. We want to use pathnames as a way to sort delta candidates near each other, since we expect them to have high locality with respect to delta-able contents. The current name_hash uses a very small bit of that path information and throws away most of it. The other extreme end is holding the whole path. We may want to end up in the middle for two reasons: 1. Dealing with whole paths might be costly (though I'm not yet convinced of that; outside of pathological cases, the number of paths in a repo tends to pale in comparison to the number of objects, and the per-object costs dominate during repacking). 2. It's possible that over-emphasizing the path might be a slightly worse heuristic (and I think this is a potential danger of --path-walk, too). We still want to find candidate pairs that were copied or renamed, for example, or that substantially share content found in different parts of the tree. So it would be interesting to be able to see the performance of various points on that line, from full path down to partial paths down to longer hashes down to the current hash. The true extreme end of course is no path info at all, but I think we know that sucks; that's why we implemented the bitmap name-hash extension in the first place. > I don't know too much about SimHash or MinHash, but based on what I > could gather from some initial reading I'm not sure that they would be > effective without increasing the hash length. We'd also get a different > kind of locality, such as the appearance of a common word would be more > likely to affect the locality than the end of the path. Good point. This is all heuristics, of course, but I suspect that the order of the path is important, and that foo/bar.c and bar/foo.c are unlikely to be good matches. > > I realize that this is taking us back to an idea you've already > > presented to the list, but I think (to me, at least) the benefit and > > simplicity of that approach has only become clear to me in hindsight > > when seeing some alternatives. I would like to apologize for the time > > you spent reworking this series back and forth to have the response be > > "maybe we should have just done the first thing you suggested". Like I > > said, I think to me it was really only clear in hindsight. > > I always assumed that we'd come back to it eventually. There is also the > extra bit about making the change to the name-hash compatible with the > way name-hashes are stored in the reachability bitmaps. That will need > some work before it is ready for prime time. Having worked on that feature of bitmaps, I'm not too worried about it. I think we'd just need to: - introduce a new bitmap ext with a flag (HASH_CACHE_V2 or something, either with the new hash, or with a "version" byte at the start for extensibility). - when bitmaps are not in use, we're free to use whichever hash we want internally. If the new hash is consistently better, we'd probably just enable it by default. - when packing using on-disk bitmaps, use internally whichever format the on-disk file provided. Technically the format could even provide both (in which case we'd prefer the new hash), but I don't see much point. - when writing bitmaps, use whichever hash the command-line options asked for. There's an off chance somebody might want to generate a .bitmap file whose hashes will be understood by an older version of git, in which case they'd use --no-full-name-hash or whatever while repacking. If we're considering full paths, then that is potentially a bit more involved, just because we'd want the format to avoid repeating duplicate paths for each object (plus they're no longer fixed-size). So probably an extension with packed NUL-terminated path strings, plus a set of fixed-length offsets into that block, one per object. > I disagree that all environments will prefer the --full-name-hash. I'm > currently repeating the performance tests right now, and I've added one. > The issues are: > > 1. The --full-name-hash approach sometimes leads to a larger pack when > using "git push" on the client, especially when the name-hash is > already effective for compressing across paths. That's interesting. I wonder which cases get worse, and if a larger window size might help. I.e., presumably we are pushing the candidates further away in the sorted delta list. > 2. A depth 1 shallow clone cannot use previous versions of a path, so > those situations will want to use the normal name hash. This can be > accomplished simply by disabling the --full-name-hash option when > the --shallow option is present; a more detailed version could be > used to check for a large depth before disabling it. This case also > disables bitmaps, so that isn't something to worry about. I'm not sure why a larger hash would be worse in a shallow clone. As you note, with only one version of each path the name-similarity heuristic is not likely to buy you much. But I'd have thought that would be true for the existing name hash as well as a longer one. Maybe this is the "over-emphasizing" case. -Peff