Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()`

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Fri, 22 Nov 2024 00:39:30 +0000

On 2024-11-21 at 09:37:31, Jeff King wrote:
> On Wed, Nov 20, 2024 at 02:13:50PM -0500, Taylor Blau wrote:
> 
> > +static const struct git_hash_algo sha1_unsafe_algo = {
> > +	.name = "sha1",
> > +	.format_id = GIT_SHA1_FORMAT_ID,
> > +	.rawsz = GIT_SHA1_RAWSZ,
> > +	.hexsz = GIT_SHA1_HEXSZ,
> > +	.blksz = GIT_SHA1_BLKSZ,
> > +	.init_fn = git_hash_sha1_init_unsafe,
> > +	.clone_fn = git_hash_sha1_clone_unsafe,
> > +	.update_fn = git_hash_sha1_update_unsafe,
> > +	.final_fn = git_hash_sha1_final_unsafe,
> > +	.final_oid_fn = git_hash_sha1_final_oid_unsafe,
> > +	.empty_tree = &empty_tree_oid,
> > +	.empty_blob = &empty_blob_oid,
> > +	.null_oid = &null_oid_sha1,
> > +};
> 
> All of the non-function fields here naturally must match the ones in the
> parent algo struct, or chaos ensues. That's a little fragile, but it's
> not like we're adding new algorithm variants a lot. The biggest risk, I
> guess, would be adding a new field to git_hash_algo which defaults to
> zero-initialization here. But again, there are only three total and they
> are defined near each other here, so I don't think it's a big risk
> overall.
> 
> I think this struct is a potential landmine for hash_algo_by_ptr():
> 
>   static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
>   {
>           return p - hash_algos;
>   }
> 
> It's undefined behavior to pass in sha1_unsafe_algo to this function
> (but the compiler would not complain since the types are the same). I
> don't find it incredibly likely that somebody would want to do that on
> an unsafe variant, but I'm not thrilled about leaving that wart for
> somebody to find.
> 
> If we don't care about the speed of this function, then an
> implementation like:
> 
>   for (i = 0; i < GIT_HASH_NALGOS; i++) {
> 	if (p == &hash_algos[i] || p == hash_algos[i]->unsafe)
> 		return i;
>   }
>   return GIT_HASH_UNKNOWN;
> 
> would work. I'm not sure if that would be measurable. I was surprised at
> the number of places that hash_algo_by_ptr() is called. Many low-level
> oid functions need it because we store the integer id there rather than
> a direct pointer (so oidread(), oidclr(), oid_object_info_extended(),
> and so on). But I'd also expect the loop above to be pretty fast. So I
> dunno.

Yeah, I'm also a little nervous about this change with hash_algo_by_ptr.
I think we had discussed in the other thread about doing something like
this:

struct git_hash_algo_fns {
	/* The hash initialization function. */
	git_hash_init_fn init_fn;

	/* The hash context cloning function. */
	git_hash_clone_fn clone_fn;

	/* The hash update function. */
	git_hash_update_fn update_fn;

	/* The hash finalization function. */
	git_hash_final_fn final_fn;

	/* The hash finalization function for object IDs. */
	git_hash_final_oid_fn final_oid_fn;
};

and then doing this:

struct git_hash_algo {
	[...]

  struct git_hash_algo_fns fns;
  struct git_hash_algo_fns unsafe_fns;

  [...]
};

That would mean that we'd have to deal with two pointers in your first
patch, but I don't really think that's too terrible.  And, since this
approach doesn't result in an extra struct git_hash_algo, it's
impossible to misuse hash_algo_by_ptr.

I'm not a hard no on the current approach, but I think the above would
probably be a little safer and harder to misuse.  It would require extra
patches to convert existing callers, though, but they aren't hugely
numerous, and the conversion could probably be mostly automated.
-- 
brian m. carlson (they/them or he/him)
Toronto, Ontario, CA
Attachment:
signature.asc

Description: PGP signature