Re: [PATCH v2 00/10] Use a structure for object IDs.

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Wed, 11 Mar 2015 17:08:58 +0100

On 03/08/2015 12:23 AM, brian m. carlson wrote:
> This is a patch series to convert some of the relevant uses of unsigned
> char [20] to struct object_id.
> 
> The goal of this series to improve type-checking in the codebase and to
> make it easier to move to a different hash function if the project
> decides to do that.  This series does not convert all of the codebase,
> but only parts.  I've dropped some of the patches from earlier (which no
> longer apply) and added others.
> 
> Certain parts of the code have to be converted before others to keep the
> patch sizes small, maintainable, and bisectable, so functions and
> structures that are used across the codebase (e.g. struct object) will
> be converted later.  Conversion has been done in a somewhat haphazard
> manner by converting modules with leaf functions and less commonly used
> structs first.
> 
> Since part of the goal is to ease a move to a different hash function if
> the project decides to do that, I've adopted Michael Haggerty's
> suggestion of using variables named "oid", naming the structure member
> "sha1", and using GIT_SHA1_RAWSZ and GIT_SHA1_HEXSZ as the constants.
> 
> I've been holding on to this series for a long time in hopes of
> converting more of the code before submitting, but realized that this
> deprived others of the ability to use the new structure and required me
> to rebase extremely frequently.
> [...]

I've added CC to several people who commented on v1.

I think this is a really interesting project and I hope that it works out.

In my opinion, the biggest risk (aside from the sheer amount of work
required) is the issue that was brought up on the mailing list when you
submitted v1 [1]: Converting an arbitrary (unsigned char *) to a (struct
object_id *) is not allowed, because the alignment requirements of the
latter might be stricter than those of the former. This means that if,
for example, we are reading some data from disk or from the network, and
we expect the 20 bytes starting with byte number 17 to be a SHA-1 in
binary format, we used to be able to do

    unsigned char *sha1 = buf + 17;

and use sha1 like any other SHA-1, without the need for any copying. But
we can't do the analogous

    struct object_id *oid = (struct object_id *)(buf + 17);

because the alignment is not necessarily correct. So in a pure "struct
object_id" world, I think we would be forced to change such code to

    struct object_id oid;
    hashcpy(oid.sha1, buf + 17);

This uses additional memory and introduces copying overhead. Also, the
lifetime of oid might exceed the scope of a function, so oid might have
to be allocated on the heap and later freed. This adds more
computational overhead, more memory overhead, and more programming
effort to get it all right.

So as much as I like the idea of wrapping SHA-1s in objects, I think it
would be a good idea to first make sure that we can all agree on a plan
for dealing with situations like this. I can think of the following
possibilities:

1. Maybe code that needs to handle SHA-1s with screwy alignment does not
exist, or maybe it does the copying anyway. I haven't actually checked.

2. Maybe somebody can prove that

       struct object_id *oid = (struct object_id *)(buf + 17);

   somehow *is* allowed by the C standard, or at least that it is
sufficiently portable for our purposes.

3. We accept the additional runtime costs and development effort for the
extra copies. To accept this approach, we would need some idea of which
areas of the code will be affected, and some estimate of how much
additional memory it would cost.

4. We continue to support working with SHA-1s declared to be (unsigned
char *) in some performance-critical code, even as we migrate most other
code to using SHA-1s embedded within a (struct object_id). This will
cost some duplication of code. To accept this approach, we would need an
idea of *how much* code duplication would be needed. E.g., how many
functions will need both (unsigned char *) versions and (struct
object_id *) versions?

5. We only make the change opportunistically. Whenever we find a
function that needs to work with non-struct-aligned SHA-1s, we leave the
function as-is rather than converting it or creating a second version
that works with object_id objects. This approach would leave the
codebase somewhat schizophrenic.

I'm not trying to dissuade you from this project, but I think that for
your project to have a chance of success, we need an answer to this
question. So let's confront it now rather than after you have invested
lots more time and/or gotten patches merged.

Michael

[1] http://thread.gmane.org/gmane.comp.version-control.git/248054

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html