Re: [PATCH v2 0/6] pack-objects: create new name-hash algorithm

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 9/18/24 4:46 PM, Derrick Stolee via GitGitGadget wrote:
...
> Other things that have happened include investigations into ways to adapt
> the full-name hash to improve upon the name-hash. I did some experimenting
> with increasing the size of 'struct object_entry' by using a 64-bit hash
> value (name-hash, then full-name-hash) for a single-pass compression or two
> 32-bit hash values for a two-pass compression process. I include my WIP
> branch at [3] to show what I tried, though the single-pass option did not
> present any improvements and the two-pass option seems to be broken to the
> point that the compression is substantially worse. (I'll try to elaborate on
> this in a reply to this cover letter.)
>
> [3] https://github.com/derrickstolee/git/compare/full-name...derrickstolee:git:full-name-wip

To break down what I attempted in [3], let me break down a few things.

First, I tried using a 64-bit hash value [1]. This used the standard name-hash
as the most-significant digits and the full-name-hash as the least-significant
digits. The goal here was to still have locality from the name-hash but get a
good partition based on full-name-hash within those collisions.

However, when sorting this way, the boundaries of the full-name-hash partitions
are ineffective at getting good delta bases because the largest object from one
full-name-hash set is next to the smallest object from the next full-name-hash
set. Even when a full-name-hash set has size one, it is sorted roughly randomly
among the other colliding path names instead of grouped nicely with objects of
a similar size. This makes the results nearly identical to the 32-bit
full-name-hash implementation.

[1] https://github.com/derrickstolee/git/commit/aaa6befa3016667ea5eb10fdd6aa2b7fcec3a52e

Second, I tried storing two 32-bit hashes and doing a two-pass delta search [2].
In theory, this should be very similar to the --path-walk feature from the RFC.
However, I failed to make it work. Something about this version of a two-pass
walk was hitting some strange behavior. For example, I had to put in this extra
condition [4] if a best delta base was not found, or else we could get a
segfault.

[2] https://github.com/derrickstolee/git/commit/bf71271040ab93a624a8cdf5bc8aaff68e9b1b17

[4] https://github.com/derrickstolee/git/commit/fedc4fc543e50563f4748a5ffc45b51b530023e0

In fact, the results were not just _bad_ but they were _significantly worse_.

It took me a long time to report these details because they just didn't make
sense and I couldn't figure out what was going wrong. I'd be very grateful to
anyone who could explore these WIP commits and point out what I'm doing wrong
so I can learn and maybe we can get a boost to the feature.

Even if we had strong data from these examples, I'm not sure that we'd want
to add four bytes per object to the packing data, especially in a way that
impacts users that aren't even using the new feature. We would want to
explore options that use some kind of hashtable to map objects to their
64-bit hash values, perhaps. It also affects the .bitmap file format, which
would need modification even for a new 32-bit hash function (though one of
the same size could be used by adding an extension saying "I'm using hash
function v2" and leave the rest of the structure the same).

I would also like to test the performance against the threaded version of the
--path-walk feature, which I recently got working in my prototype branch [5].

[5] https://github.com/derrickstolee/git/pull/28/commits/a9fc233390ae00e3d4b156be64d6b3974e30d8a1

Thanks,
-Stolee




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux