Re: [PATCH 0/7] pack-objects: Create an alternative name hash algorithm (recreated)

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 22 Nov 2024 12:01:27 +0900

Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:

> +       while ((c = (uint8_t) *name++) != 0) {
> +               if (isspace(c))
> +                       continue;
> +               if (c == '/') {
> +                       base = (base >> 6) ^ hash;
> +                       hash = 0;
> +               } else {
> +                       uint8_t nybble_swapped = (c >> 4) + ((c & 15) << 4);
> +                       hash = (hash >> 2) + (nybble_swapped << 24);
> +               }
>         }
> +       return (base >> 6) ^ hash;
>  }

Nice.  The diff relative to the --full-name-hash version is a bit
hard to grok, but compared to the current hash function, there are
two and a half changes that matter:

 (0) it is more careful with bytes with the MSB set (i.e. non-ASCII
     pathnames).

 (1) it hashes each path component separetely and rotates the whole
     thing only at a directory boundary.  I'd imagine that this
     would make a big difference for languages that force overly
     long filenames at each level.

 (2) it gives more weight to lower bits by swapping nybbles of each
     byte.

I wonder if we do even better if we reverse all 8 bits instead of
swapping nybbles (if we were to do so, it might be more efficient to
shift in from the right instead of left end of the base and hash
accumulators in the loop and then swap the whole resulting word at
the end).

Thanks for a fun read.