Re: [PATCH 0/4] pack-objects: create new name-hash algorithm

Jeff King <peff@xxxxxxxx> · Wed, 11 Sep 2024 02:32:03 -0400

On Tue, Sep 10, 2024 at 05:05:09PM -0400, Derrick Stolee wrote:

> > I'm not sure which value you'd actually record in the pack, though.
> > Ideally there is a hash function which captures some information about
> > the full path as well as the final path component, so we could use a
> > single value here, though I suspect the implementation would be more
> > complicated than what is presented here.
> 
> Is the name hash stored in the pack itself? I know that it is stored
> in the 'struct object_entry' data in the packing data. While we could
> add another uint32_t into that struct to store both hash values, this
> would increase the memory requirements of repacking by four bytes per
> object. The struct seemed to be very clear about trying as hard as
> possible to avoid doing that.

It's stored in the .bitmap files, since otherwise a pack-objects which
uses bitmaps to serve a fetch would have no clue of their path names.
See the "HASH_CACHE" bitmap extension.

You generally don't want to make deltas out of two entries in the bitmap
(they're already in the same pack, so we'd usually skip them), but you
do want to consider making on-the-fly deltas against other objects.

I guess we may also consider deltas between objects in two packs that
are both covered by the same midx bitmap.

> But maybe an alternative could be replacing that 32-bit number with
> an index into an array of paths that have their hash values stored
> there.

Yes, that would work, though how big is that path array going to be?
Uncompressed linux.git is probably 3-4MB, which actually doesn't sound
_too_ bad. You could obviously go a long way with prefix compression,
too.

But if I understand the proposal, it is just replacing one 32-bit hash
with another. You could just store that in the bitmap instead (or if the
direction is to use both, introduce a new extension to store both).
Obviously you'll get lousy results if the bitmap reader does not use the
same algorithm for its non-bitmap objects, but I don't think this is
something you'd be flipping back and forth on.

> This is part of the reason why I think the --full-name-hash option is
> an interesting consideration. It doesn't have any obvious reason why
> it couldn't work with features like delta islands, so it may provide
> some quick wins in "large enough" repositories, or at least "large in
> the right way".
> 
> I unfortunately don't know enough about how the delta islands feature
> works to be confident in the possibility of integrating it with the
> --path-walk option. At minimum, it would require two object walks:
> the first would mark the objects and the second would do the delta
> compression with those markings in mind.

The delta islands code already does its own tree walk to propagate the
bits down (it does rely on the base walk's show_commit() to propagate
through the commits).

Once each object has its island bitmaps, I think however you choose to
come up with delta candidates (whether the current type/size/namehash
sorted list, or some path walking), you should be able to use it. It's
fundamentally just answering the question of "am I allowed to delta
between these two objects".

Of course the devil may be in the details. ;)

-Peff