Re: [PATCH 2/4] hash.h: scaffolding for _fast hashing variants

Patrick Steinhardt <ps@xxxxxx> · Mon, 2 Sep 2024 15:41:24 +0200

On Sun, Sep 01, 2024 at 12:03:24PM -0400, Taylor Blau wrote:
> Git's default SHA-1 implementation is collision-detecting, which hardens
> us against known SHA-1 attacks against Git objects. This makes Git
> object writes safer at the expense of some speed when hashing through
> the collision-detecting implementation, which is slower than
> non-collision detecting alternatives.
> 
> Prepare for loading a separate "fast" SHA-1 implementation that can be
> used for non-cryptographic purposes, like computing the checksum of
> files that use the hashwrite() API.
> 
> This commit does not actually introduce any new compile-time knobs to
> control which implementation is used as the fast SHA-1 variant, but does
> add scaffolding so that the "git_hash_algo" structure has five new
> function pointers which are "fast" variants of the five existing
> hashing-related function pointers:
> 
>   - git_hash_init_fn fast_init_fn
>   - git_hash_clone_fn fast_clone_fn
>   - git_hash_update_fn fast_update_fn
>   - git_hash_final_fn fast_final_fn
>   - git_hash_final_oid_fn fast_final_oid_fn
> 
> The following commit will introduce compile-time knobs to specify which
> SHA-1 implementation is used for non-cryptographic uses.

While the property we care about in the context of this patch series
indeed is that the second hash is faster, I think the more important
property is that it's insecure. If I were seeing two APIs, one labelled
fast and one labelled slow, I would of course pick the fast one. So I
wonder whether we should rename things accordingly so that developers
aren't intrigued to pick the fast one without thinking, and also to have
a more useful signal that stands out to reviewers.

> Signed-off-by: Taylor Blau <me@xxxxxxxxxxxx>
> ---
>  hash.h        | 42 ++++++++++++++++++++++++++++++++++++++++++
>  object-file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 84 insertions(+)
> 
> diff --git a/hash.h b/hash.h
> index 72ffbc862e5..f255e5c1e8a 100644
> --- a/hash.h
> +++ b/hash.h
> @@ -44,14 +44,32 @@
>  #define platform_SHA1_Final    	SHA1_Final
>  #endif
>  
> +#ifndef platform_SHA_CTX_fast
> +#define platform_SHA_CTX_fast	  platform_SHA_CTX
> +#define platform_SHA1_Init_fast	  platform_SHA1_Init
> +#define platform_SHA1_Update_fast platform_SHA1_Update
> +#define platform_SHA1_Final_fast  platform_SHA1_Final
> +#ifdef platform_SHA1_Clone
> +#define platform_SHA1_Clone_fast  platform_SHA1_Clone
> +#endif
> +#endif

We may want to apply our new coding guidelines around nested
preprocessor directives, which should also use indenting.

> @@ -222,6 +249,21 @@ struct git_hash_algo {
>  	/* The hash finalization function for object IDs. */
>  	git_hash_final_oid_fn final_oid_fn;
>  
> +	/* The fast hash initialization function. */

Providing some context here why there are two sets of functions would
help future readers.

> @@ -219,6 +256,11 @@ const struct git_hash_algo hash_algos[GIT_HASH_NALGOS] = {
>  		.update_fn = git_hash_sha256_update,
>  		.final_fn = git_hash_sha256_final,
>  		.final_oid_fn = git_hash_sha256_final_oid,
> +		.fast_init_fn = git_hash_sha256_init,
> +		.fast_clone_fn = git_hash_sha256_clone,
> +		.fast_update_fn = git_hash_sha256_update,
> +		.fast_final_fn = git_hash_sha256_final,
> +		.fast_final_oid_fn = git_hash_sha256_final_oid,
>  		.empty_tree = &empty_tree_oid_sha256,
>  		.empty_blob = &empty_blob_oid_sha256,
>  		.null_oid = &null_oid_sha256,

I was briefly wondering whether we'd rather want to have automatic
fallbacks to the secure alternative when the fast one isn't set. I guess
in the end that's not really worth it though, as it saves us one branch
when not having such a fallback. And we only have so many hashes, so
it's not like this would be a huge pain to maintain.

Patrick