On Sat, Jan 18, 2025 at 07:28:31AM -0500, Jeff King wrote: > It does make me wonder if hash_algo_by_ptr() should just return UNKNOWN > for the unsafe variant. Then we'd at least get a loud error from the > caller (as opposed to the code before your patch, which is undefined > behavior). I dunno. Doing this: diff --git a/hash.h b/hash.h index 68d4292e6d..ad2c919991 100644 --- a/hash.h +++ b/hash.h @@ -311,7 +311,7 @@ static inline int hash_algo_by_ptr(const struct git_hash_algo *p) size_t i; for (i = 0; i < GIT_HASH_NALGOS; i++) { const struct git_hash_algo *algop = &hash_algos[i]; - if (p == algop || (algop->unsafe && p == algop->unsafe)) + if (p == algop) return i; } return GIT_HASH_UNKNOWN; on top of your series doesn't trigger any issues in the test suite. Which I think shows that we would never have triggered the UB from the original implementation, either[1]. Still, I think I prefer your loop to being one errant function call away from undefined behavior. But maybe returning UNKNOWN (as above) is a safer bet than losing the inverse nature of the by_ptr() and by_id() functions, which could otherwise cause hard-to-find interactions? -Peff [1] One thing that still puzzles me. If you further add a BUG() to that function when we'd return UNKNOWN, you can see that we trigger a few cases in the test suite where we pass in NULL (e.g., because we're not in a repo). I think these are already UB with the existing "p - hash_algos" implementation! We'll return what is effectively "p" cast to an int. E.g. if I do this: diff --git a/hash.h b/hash.h index 756166ce5e..ff31156416 100644 --- a/hash.h +++ b/hash.h @@ -320,6 +320,8 @@ int hash_algo_by_length(int len); /* Identical, except for a pointer to struct git_hash_algo. */ static inline int hash_algo_by_ptr(const struct git_hash_algo *p) { + if (!p) + BUG("null hash, return is %d", (int)(p - hash_algos)); return p - hash_algos; } then t1517 shows grep dying with: BUG: hash.h:324: null hash, return is -1646754892 I think it "works" because the backtrace is: [...] #5 0x0000556d436f6b71 in BUG_fl (file=0x556d43790e8b "hash.h", line=324, fmt=0x556d43790e73 "null hash, return is %d") at usage.c:335 #6 0x0000556d4357c2f9 in hash_algo_by_ptr (p=0x0) at /home/peff/compile/git/hash.h:324 #7 0x0000556d4357c437 in oidclr (oid=0x7ffdf5810ea4, algop=0x0) at /home/peff/compile/git/hash.h:392 #8 0x0000556d435801c7 in prep_exclude (dir=0x7ffdf5811190, istate=0x556d608959c0, base=0x556d6089da50 "nums", baselen=0) at dir.c:1699 [...] #16 0x0000556d4344dd4a in cmd_grep (argc=0, argv=0x556d60895ee8, prefix=0x0, repo=0x0) at builtin/grep.c:1257 So we probably write a totally garbage algo field into the object_id, but nobody ever ends up looking at it. Probably something we should clean up, but way out of scope for your series. But I do think it reinforces that returning UNKNOWN is an improvement; it avoids undefined behavior and anybody who tried to use it should get a BUG() from calling git_hash_unknown_*() functions.