Re: [PATCH 0/7] pack-objects: Create an alternative name hash algorithm (recreated)

Derrick Stolee <stolee@xxxxxxxxx> · Fri, 22 Nov 2024 10:27:19 -0500

On 11/21/24 10:01 PM, Junio C Hamano wrote:
Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:

+       while ((c = (uint8_t) *name++) != 0) {
+               if (isspace(c))
+                       continue;
+               if (c == '/') {
+                       base = (base >> 6) ^ hash;
+                       hash = 0;
+               } else {
+                       uint8_t nybble_swapped = (c >> 4) + ((c & 15) << 4);
+                       hash = (hash >> 2) + (nybble_swapped << 24);
+               }
         }
+       return (base >> 6) ^ hash;
  }

Nice.  The diff relative to the --full-name-hash version is a bit
hard to grok, but compared to the current hash function, there are
two and a half changes that matter:

  (0) it is more careful with bytes with the MSB set (i.e. non-ASCII
      pathnames).

  (1) it hashes each path component separetely and rotates the whole
      thing only at a directory boundary.  I'd imagine that this
      would make a big difference for languages that force overly
      long filenames at each level.

I was confused by the "rotates the whole thing only at a directory
boundary" statement. I think one way to say what you mean is

  Each path component is hashed similarly to the standard name-hash,
  and parent path component hashes are contributed via XOR after a
  down-shift of 6 bits per level.

So we are getting something like

	[ name-hash for level 0           ]
        ......[ name-hash for level 1     ](truncated by 6)
 	............[name-hash for level 2](truncated by 12)
 	..................[...for level 3 ](truncated by 18)
 	........................[ level 4 ](truncated by 24)
 	..............................[ 5 ](truncated by 30)

and at each layer we get the "last 16 bytes matter" issue, though it
is balanced quite well. Also, the name-hash in each layer is adjusted
for nybble swaps.

(I don't think my explanation is _better_ but just that it matches my
personal mental model slightly better.)

  (2) it gives more weight to lower bits by swapping nybbles of each
      byte.

I wonder if we do even better if we reverse all 8 bits instead of
swapping nybbles (if we were to do so, it might be more efficient to
shift in from the right instead of left end of the base and hash
accumulators in the loop and then swap the whole resulting word at
the end).
I will give this a try in my private repos as well as with the name-hash
collision perf test from patch 7.

Thanks,
-Stolee