Re: [PATCH 3/6] hash.h: introduce `unsafe_hash_algo()`

Jeff King <peff@xxxxxxxx> · Thu, 21 Nov 2024 04:37:31 -0500

On Wed, Nov 20, 2024 at 02:13:50PM -0500, Taylor Blau wrote:

> +static const struct git_hash_algo sha1_unsafe_algo = {
> +	.name = "sha1",
> +	.format_id = GIT_SHA1_FORMAT_ID,
> +	.rawsz = GIT_SHA1_RAWSZ,
> +	.hexsz = GIT_SHA1_HEXSZ,
> +	.blksz = GIT_SHA1_BLKSZ,
> +	.init_fn = git_hash_sha1_init_unsafe,
> +	.clone_fn = git_hash_sha1_clone_unsafe,
> +	.update_fn = git_hash_sha1_update_unsafe,
> +	.final_fn = git_hash_sha1_final_unsafe,
> +	.final_oid_fn = git_hash_sha1_final_oid_unsafe,
> +	.empty_tree = &empty_tree_oid,
> +	.empty_blob = &empty_blob_oid,
> +	.null_oid = &null_oid_sha1,
> +};

All of the non-function fields here naturally must match the ones in the
parent algo struct, or chaos ensues. That's a little fragile, but it's
not like we're adding new algorithm variants a lot. The biggest risk, I
guess, would be adding a new field to git_hash_algo which defaults to
zero-initialization here. But again, there are only three total and they
are defined near each other here, so I don't think it's a big risk
overall.

I think this struct is a potential landmine for hash_algo_by_ptr():

  static inline int hash_algo_by_ptr(const struct git_hash_algo *p)
  {
          return p - hash_algos;
  }

It's undefined behavior to pass in sha1_unsafe_algo to this function
(but the compiler would not complain since the types are the same). I
don't find it incredibly likely that somebody would want to do that on
an unsafe variant, but I'm not thrilled about leaving that wart for
somebody to find.

If we don't care about the speed of this function, then an
implementation like:

  for (i = 0; i < GIT_HASH_NALGOS; i++) {
	if (p == &hash_algos[i] || p == hash_algos[i]->unsafe)
		return i;
  }
  return GIT_HASH_UNKNOWN;

would work. I'm not sure if that would be measurable. I was surprised at
the number of places that hash_algo_by_ptr() is called. Many low-level
oid functions need it because we store the integer id there rather than
a direct pointer (so oidread(), oidclr(), oid_object_info_extended(),
and so on). But I'd also expect the loop above to be pretty fast. So I
dunno.

>  const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
>  	{
>  		.name = NULL,
> @@ -241,6 +257,7 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
>  		.unsafe_update_fn = git_hash_sha1_update_unsafe,
>  		.unsafe_final_fn = git_hash_sha1_final_unsafe,
>  		.unsafe_final_oid_fn = git_hash_sha1_final_oid_unsafe,
> +		.unsafe = &sha1_unsafe_algo,
>  		.empty_tree = &empty_tree_oid,
>  		.empty_blob = &empty_blob_oid,
>  		.null_oid = &null_oid_sha1,

OK, and we can leave the sha256 pointer zero-initialized, since the
function handles that at runtime. Good.

-Peff