Re: Suggestion on hashing

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Fri, 2 Dec 2011 21:22:31 +0700

(I'm not sure why you dropped git@vger. I see nothing private here so
I bring git@vger back)

On Fri, Dec 2, 2011 at 3:08 PM, Bill Zaumen <bill.zaumen@xxxxxxxxx> wrote:
> At one point Nguyen said that "What I'm thinking is whether it's
> possible to decouple two sha-1 roles in git, as object identifier
> and digest, separately. Each sha-1 identifies an object and an extra
> set of digests on the "same" object."
>
> My code pretty much does that (it just uses a CRC instead of a real
> digest, but I can easily change that).

It'd be easier to look at your code if you split it into a series of
smaller patches.

> So the question is whether
> using SHA-1 as an ID and SHA-256(?) as a digest is a better long term
> solution than simply replacing SHA-1.

I would not stick with any algorithm permanently. No one knows when
SHA-256 might be broken.

> If there is some interest in pursuing it further, I could make those
> changes fairly easily.  Then you'd have two message digests, a SHA-1
> and a longer one, with the longer one stored parallel to the actual
> object. Then it becomes easy to compute a digest of all the digests
> in a commit's tree and store that in a commit, if that is what you
> want to do.

I personally would like to see how it works out especially when
computing new digests is much more expensive than SHA-1. And I hope
that by delaying computing new digests (stored outside actual
objects), we could make minimum code changes to git. Though security
concerns may be the killer factor and I haven't worked that out yet.

> Replacing SHA-1 with something like SHA-256 sounds easier to implement,

SHA-1 charateristics (like 20 byte length) are hard coded everywhere
in git, it'd be a big audit.

> but the problem is all the existing repositories.  While rewriting all
> the objects and trees to use new hashes is similar to a rebase in most
> cases, there is a complication - submodules.  Git stores the hash of
> a submodule's commit in its tree because a particular revision of
> a project 'goes' with a particular revision of a submodule. But, a
> submodule can exist in one revision and not in the next or previous
> revision  Furthermore A could be a submodule of B at one point in time,
> and many commits later, B could end up being a submodule of A.
> Fixing it up could be pretty complicated (plus having to deal with
> network failures - to update GitHub for example, you'd have to download
> submodules it uses, possibly from somewhere else and some submodules may
> not be publicly accessible (e.g., a private project kept on GitHub but
> with a critical submodule kept in house behind a corporate firewall).
> Also, you might have to update a git repository and its submodules
> concurrently, so that you always can find a new value when you need
> it.
>
> My guess is that this could be far more complicated than what I did.
> Excluding two files that are not used (the symbol PACKDB is not
> defined), I added two new files, crcdb.h and objd-crcdb.c which store
> CRCs for loose objects - 517 lines total including lots of comments in
> the header file - full documentation for each function.  The other
> changes include 1475 lines of new code in previously existing git files
> and 136 deletions (most trivial).  There were also minor changes to
> the makefile and test scripts.

You'd need to convince git maintainer this is worth doing first,
before talking how big the changes are ;-)

> Bill
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html