[RFC PATCH 0/6] Hash Abstraction

"brian m. carlson" <sandals@xxxxxxxxxxxxxxxxxxxx> · Mon, 21 Aug 2017 00:00:16 +0000

= Overview

This is an RFC series proposing a basic abstraction for hash functions.

As we get closer to converting the remainder of the codebase to use
struct object_id, we should think about the design we want our hash
function abstraction to take.  This series is a proposal for one idea to
start discussion.  Input on any aspect of this proposal is welcome.

This series exposes a struct git_hash_algo that contains basic
information about a given hash algorithm that distinguishes it from
other algorithms: name, lengths, implementing functions, and empty tree
and blob constants.  It also exposes an array of hash algorithms, and a
constant for indexing them.

The series also demonstrates a simple conversion using the abstraction
over empty blob and tree values.

In order to avoid conflicting with the struct repository work and with
the goal of avoiding global variables as much as possible, I've pushed
the hash algorithm into struct repository and exposed it via a #define.
This necessitiates pulling repository.h into cache.h, which I don't
think is fatal.  Doing that, in turn, necessitated some work on the
Subversion code to avoid conflicts.

It should be fine for Junio to pick up the first two patches from this
series, as they're relatively independent and valuable without the rest
of the series.  The rest should not be applied immediately, although
they do pass the testsuite.

I proposed this series now as it will inform the way we go about
converting other parts of the codebase, especially some of the pack
algorithms.  Because we share some hash computation code between pack
checksums and object hashing, we need to decide whether to expose pack
checksums as struct object_id, even though they are technically not
object IDs.  Furthermore, if we end up needing to stuff an algorithm
value into struct object_id, we'll no longer be able to directly
reference object IDs in a pack without a copy.

This series is available from the usual places as branch hash-struct,
based against master.

= Naming

The length names are similar to the current constant names
intentionally.  I've used the "hash_algo" name for both the integer
constant and the pointer to struct, although we could change the latter
to "hash_impl" or such as people like.

I chose to name the define "current_hash" and expose no other defines.
The name is relatively short since we're going to be typing it a lot.
However, if people like, we can capitalize it or expose other defines
(say, a GIT_HASH_RAWSZ or GIT_HASH_HEXSZ) instead of or in addition to
current_hash, which would make this name less interesting.

Feel free to propose alternatives to the naming of anything in this
series.

= Open Issues

I originally decided to convert hex.c as an example, but quickly found
out that this caused segfaults.  As part of setup, we call
is_git_directory, which calls validate_headref, which ends up calling
get_sha1_hex.  Obviously, we don't have a repository, so the hash
algorithm isn't set up yet.  This is an area we'll need to consider
making hash function agnostic, and we may also need to consider
inserting a hash constant integer into struct object_id if we're going
to do that.  Alternatively, we could just paper over this issue as a
special case.

Clearly we're going to want to expose some sort of lookup functionality
for hash algorithms.  We'll need to expose lookup by name (for the
.git/config file and any command-line options), but we may want other
functions as well.  What functions should those be?  Should we expose
the structure or the constant for those lookup functions?  If the
structure, we'll probably need to expose the constant in the structure
as well for easy use.

Should we avoid exposing the array of structure altogether and wrap this
in a function?

We could expose a union of hash context structures and take that as the
pointer type for the API calls.  That would probably obviate the need
for ctxsz.

We could expose hex versions of the blob constants if desired.  This
might make converting the remaining pieces of code that use them easier.

There are probably dozens of other things I haven't thought of yet as
well.

brian m. carlson (6):
  vcs-svn: remove unused prototypes
  vcs-svn: rename repo functions to "svn_repo"
  setup: expose enumerated repo info
  Add structure representing hash algorithm
  Integrate hash algorithm support with repo setup
  Switch empty tree and blob lookups to use hash abstraction

 builtin/am.c        |  2 +-
 builtin/checkout.c  |  2 +-
 builtin/diff.c      |  2 +-
 builtin/pull.c      |  2 +-
 cache.h             | 48 ++++++++++++++++++++++++++++++++++++++++++++----
 diff-lib.c          |  2 +-
 merge-recursive.c   |  2 +-
 notes-merge.c       |  2 +-
 repository.c        |  7 +++++++
 repository.h        |  5 +++++
 sequencer.c         |  6 +++---
 setup.c             | 48 +++++++++++++++++++++++++++---------------------
 sha1_file.c         | 29 +++++++++++++++++++++++++++++
 submodule.c         |  2 +-
 vcs-svn/repo_tree.c |  6 +++---
 vcs-svn/repo_tree.h | 13 +++----------
 vcs-svn/svndump.c   |  8 ++++----
 17 files changed, 133 insertions(+), 53 deletions(-)