Re: [PATCH v3 0/8] hash: introduce unsafe_hash_algo(), drop unsafe_ variants

Jeff King <peff@xxxxxxxx> · Sat, 18 Jan 2025 07:43:43 -0500

On Sat, Jan 18, 2025 at 07:28:31AM -0500, Jeff King wrote:

> It does make me wonder if hash_algo_by_ptr() should just return UNKNOWN
> for the unsafe variant. Then we'd at least get a loud error from the
> caller (as opposed to the code before your patch, which is undefined
> behavior). I dunno.

Doing this:

diff --git a/hash.h b/hash.h
index 68d4292e6d..ad2c919991 100644
--- a/hash.h
+++ b/hash.h
@@ -311,7 +311,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
 	size_t i;
 	for (i = 0; i < GIT_HASH_NALGOS; i++) {
 		const struct git_hash_algo *algop = &hash_algos[i];
-		if (p == algop || (algop->unsafe && p == algop->unsafe))
+		if (p == algop)
 			return i;
 	}
 	return GIT_HASH_UNKNOWN;

on top of your series doesn't trigger any issues in the test suite.
Which I think shows that we would never have triggered the UB from the
original implementation, either[1]. Still, I think I prefer your loop to
being one errant function call away from undefined behavior. But maybe
returning UNKNOWN (as above) is a safer bet than losing the inverse
nature of the by_ptr() and by_id() functions, which could otherwise
cause hard-to-find interactions?

-Peff

[1] One thing that still puzzles me. If you further add a BUG() to that
    function when we'd return UNKNOWN, you can see that we trigger a few
    cases in the test suite where we pass in NULL (e.g., because we're
    not in a repo). I think these are already UB with the existing
    "p - hash_algos" implementation! We'll return what is effectively
    "p" cast to an int.

    E.g. if I do this:

    diff --git a/hash.h b/hash.h
    index 756166ce5e..ff31156416 100644
    --- a/hash.h
    +++ b/hash.h
    @@ -320,6 +320,8 @@ int hash_algo_by_length(int len);
     /* Identical, except for a pointer to struct git_hash_algo. */
     static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
     {
    +	if (!p)
    +		BUG("null hash, return is %d", (int)(p - hash_algos));
     	return p - hash_algos;
     }

     then t1517 shows grep dying with:

       BUG: hash.h:324: null hash, return is -1646754892

     I think it "works" because the backtrace is:

      [...]
      #5  0x0000556d436f6b71 in BUG_fl (file=0x556d43790e8b "hash.h", line=324,
          fmt=0x556d43790e73 "null hash, return is %d") at usage.c:335
      #6  0x0000556d4357c2f9 in hash_algo_by_ptr (p=0x0) at /home/peff/compile/git/hash.h:324
      #7  0x0000556d4357c437 in oidclr (oid=0x7ffdf5810ea4, algop=0x0) at /home/peff/compile/git/hash.h:392
      #8  0x0000556d435801c7 in prep_exclude (dir=0x7ffdf5811190, istate=0x556d608959c0, base=0x556d6089da50 "nums",
          baselen=0) at dir.c:1699
      [...]
      #16 0x0000556d4344dd4a in cmd_grep (argc=0, argv=0x556d60895ee8, prefix=0x0, repo=0x0) at builtin/grep.c:1257

     So we probably write a totally garbage algo field into the
     object_id, but nobody ever ends up looking at it. Probably
     something we should clean up, but way out of scope for your series.
     But I do think it reinforces that returning UNKNOWN is an
     improvement; it avoids undefined behavior and anybody who tried to
     use it should get a BUG() from calling git_hash_unknown_*()
     functions.