[PATCH 2/3] Add documentation for fast hash collision detection

Bill Zaumen <bill.zaumen@xxxxxxxxx> · Tue, 29 Nov 2011 22:12:39 -0800

The documentation added is technical documentation describing
how fast hash collision detection operates and changes to
several git commands (a few new command-line arguments, mostly).

Note: the change to the implementation is in the child of
this commit.

Signed-off-by: Bill Zaumen <bill.zaumen+git@xxxxxxxxx>
---
 Documentation/git-count-objects.txt          |   12 +-
 Documentation/git-index-pack.txt             |   16 +-
 Documentation/git-verify-pack.txt            |   23 ++
 Documentation/technical/collision-detect.txt |  385 ++++++++++++++++++++++++++
 Documentation/technical/pack-format.txt      |   40 +++
 5 files changed, 471 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/technical/collision-detect.txt

diff --git a/Documentation/git-count-objects.txt b/Documentation/git-count-objects.txt
index 23c80ce..4cdbaf5 100644
--- a/Documentation/git-count-objects.txt
+++ b/Documentation/git-count-objects.txt
@@ -8,7 +8,7 @@ git-count-objects - Count unpacked number of objects and their disk consumption
 SYNOPSIS
 --------
 [verse]
-'git count-objects' [-v]
+'git count-objects' [-v] [-M]
 
 DESCRIPTION
 -----------
@@ -25,6 +25,16 @@ OPTIONS
 	objects, number of packs, disk space consumed by those packs,
 	and number of objects that can be removed by running
 	`git prune-packed`.
+-M::
+--count-md::
+	Report the number of loose objects with no stored message digests.
+	With the -v option, the number of missing "mds" files (these
+	contain the message digests for the SHA1 hashes in the corresponding
+	"idx" files) is reported, along with a count of the number of
+	mds files whose size is wrong (e.g., an index was created but the
+	existing MDS file was not updated) and a count of the number of
+	objects in pack files that do not have a stored message digest.
+	Values that are zero are not shown.
 
 GIT
 ---
diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 909687f..3285fae 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -11,14 +11,14 @@ SYNOPSIS
 [verse]
 'git index-pack' [-v] [-o <index-file>] <pack-file>
 'git index-pack' --stdin [--fix-thin] [--keep] [-v] [-o <index-file>]
-                 [<pack-file>]
+		 [-m <mds-file>] [<pack-file>]
 

 DESCRIPTION
 -----------
-Reads a packed archive (.pack) from the specified file, and
-builds a pack index file (.idx) for it.  The packed archive
-together with the pack index can then be placed in the
+Reads a packed archive (.pack) from the specified file, and builds a
+pack index file (.idx) and a pack mds file (.mds) for it.  The packed
+archive together with the pack index can then be placed in the
 objects/pack/ directory of a git repository.
 

@@ -35,6 +35,14 @@ OPTIONS
 	fails if the name of packed archive does not end
 	with .pack).
 
+-m <mds-file>::
+	Write the generated pack mds file into the specified.
+	file Without this option, the name of the pack mds
+	file is constructed from the name of packed archive
+	file by replacing .pack with .idx (and the program
+	fails if the name of packed archive does not end
+	with .pack).
+
 --stdin::
 	When this flag is provided, the pack is read from stdin
 	instead and a copy is then written to <pack-file>. If
diff --git a/Documentation/git-verify-pack.txt b/Documentation/git-verify-pack.txt
index cd23076..e81c514 100644
--- a/Documentation/git-verify-pack.txt
+++ b/Documentation/git-verify-pack.txt
@@ -33,6 +33,15 @@ OPTIONS
 	Do not verify the pack contents; only show the histogram of delta
 	chain length.  With `--verbose`, list of objects is also shown.
 
+-M::
+--show-mds::
+	Show the message digests along with the 40-character object names
+	(SHA1 value in hexidecimal). Ignored if --stat-only is set. If
+	--verbose is not set, only the table indexed by object names is
+	shown, although the files will be verified.  The message digests
+	printed are the actual ones - if the MDS file does not contain these,
+	the verification will fail.
+
 \--::
 	Do not interpret any more arguments as options.
 
@@ -48,6 +57,20 @@ for objects that are not deltified in the pack, and
 
 for objects that are deltified.
 
+When the -M option is used, the offset-in-pack field is followed by an
+entry giving the message digest.  The format used is:
+
+      md=0xHEX_VALUE
+
+when a message digest exists, and
+
+     <no md>
+
+when a message digest does not exist.  These entries precede the depth
+entry for deltified objects.  A non-existent message digest will be shown
+only if the MDS file is missing - while the MDS-file format allows missing
+entries, the file will not be considered valid.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/Documentation/technical/collision-detect.txt b/Documentation/technical/collision-detect.txt
new file mode 100644
index 0000000..0d33da8
--- /dev/null
+++ b/Documentation/technical/collision-detect.txt
@@ -0,0 +1,385 @@
+Fast Hash-Collision Detection
+=============================
+
+Initially Git used a SHA-1 hash as an object ID under the assumption
+that a hash collision would never occur in practice. While an
+accidental SHA-1 collision is extremely unlikely, it is possible,
+although very expensive, to generate multiple files with the same
+SHA-1 value in under 2^57 operations.  With computer performance
+increasing significantly from one year to the next, Git's assumptions
+about SHA-1 will eventually not hold in the case of a malicious
+attempt to damage a project.  One should note that just because the
+probability of a SHA-1 collision occurring accidentally is extremely
+low does not mean a priori that SHA-1 provides an adequate safety
+margin for preventing a malicious attempt to damage repositories and a
+discussion below outlines some of the issues regarding this
+possibility.
+
+While one could modify Git to use SHA-224, SHA-256, SHA-384, or
+SHA-512 instead of SHA-1, the change would have to support the
+original format as well (in order to deal with existing Git
+repositories). While one could convert an existing repository to use
+the new hash function, this would require rewriting every object,
+including trees and commits.  The outcome would be problematic given
+the existence of email and documentation that might name commits by
+their SHA-1 hashes. One should note that Git performs a byte-by-byte
+check for hash collisions when a pack file is indexed.  Unfortunately,
+during fetch or pull operations, Git tries to avoid copying objects
+when a peer already has a copy, and this is determined solely on the
+basis of SHA-1 hashes.
+
+The following describes a modification to Git's initial design that is
+(a) relatively easy to implement, (b) is compatible with and can
+interoperate with older versions of Git (both the program and the
+repositories) (c) has a small computational overhead, and (d)
+increases security substantially, with a goal of detecting hash
+collisions early and automatically.  Because the implementation is
+relatively simple and the overhead very low, it makes sense to
+incorporate this change (or some alternative) before the security
+issue becomes a serious problem.
+
+Although Git generally uses that assumption that there will never be a
+hash collision using SHA-1 in practice, under some circumstances, Git
+will detect collisions via a byte-by-byte comparison as objects are
+added to the repository or as pack files are indexed.  This test is
+performed when an index is built (via the Git pack-index command), but
+a byte-by-byte comparison was deemed too computationally expensive to
+use in all circumstances: with pack files in particular, simply
+extracting an object can require not only decompressing it, but
+handling a series of delta encodings.
+
+Collision detection has been extended by computing a message digest or
+CRC of the object's contents (i.e., excluding the Git header). These
+message digests are stored separately from Git objects and are used
+for an independent collision test - looking up the message digests or
+CRCs using the SHA-1 IDs as a key can be done quickly, and comparing
+them is fast as well (a single unsigned-integer comparison for a
+32-bit CRC).  Assuming statistical independence in the CRC case, the
+changes of an undetected SHA-1 collision, should one occur, is 1 in
+2^32.  This extension is computationally cheap (timing the Git test
+suite (run via 'make test') showed only a small increase in running
+time and the extension is backwards compatible with existing Git
+repositories - if a CRC is not available for a SHA-1 value, the
+implementation reverts to its former behavior and simply compares
+SHA-1 values.  The CRC can, of course, be easily replaced with a
+SHA-256 or SHA-512 digest to reduce the chances of an undetected SHA-1
+collision to nearly zero.  For that reason, in the following we will
+use the terms MD (Message Digest) and CRC interchangeably.
+
+The implementation creates a directory in .git/objects named "crcs",
+which contains sub-directories and file names identical to the
+sub-directories in objects used to store loose objects: a two
+character directory name, with a 38-character file name, the
+concatenation of which gives the SHA-1 hash for the object.  the files
+in sub-directories of "crcs", however, simply contain 32-bit CRCs
+stored in network byte order.  In addition, for each pack file
+(.../objects/pack/FILE.pack), there is a corresponding file named
+.../objects/pack/FILE.mds in addition to .../objects/pack/FILE.idx.
+The MDS file contains the CRCs, stored in the same order as the SHA-1
+hashes in .../objects/pack/FILE.idx.  The format of the MDS file is
+described in pack-format.txt.
+
+Thus, the directory structure (only part of it is shown) is as
+follows:
+
+ .git---.
+	|
+	|-objects-.
+	|	  |--XX--.
+	.	  |	 |--XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+	.	  |	 .
+	.	  |	 .
+		  |	 .
+		  .
+		  .
+		  .
+		  |-crcs-.
+		  |	 |--XX--.
+		  |	 |	|--XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
+		  |	 .	.
+		  |	 .	.
+		  |	 .	.
+		  |
+		  |-pack-.
+		  |	 |--YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY.pack
+		  |	 |--YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY.idx
+		  |	 |--YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY.mds
+		  |	 .
+		  |	 .
+		  |	 .
+		  |
+		  `-info-.
+			 .
+			 .
+
+The mds files are relatively short - an average of 5 bytes per CRC
+plus some fixed overhead due to a header and trailer, with the CRCs
+listed in the same order as the SHA-1 values in the matching idx file
+(a function named nth_packed_object_objcrc32 has the same signature as
+the previously-defined function nth_packed_object_offset, so the
+procedure to look up the MD/CRC value from a pack file is the same).
+
+For fetch and push operations, the commands fetch-pack, send-pack,
+receive-pack, and upload-pack were modified so that various object IDs
+can have any one of the following formats, with each number
+represented in hexadecimal:
+
+		SHA1
+		SHA1-MD
+		SHA1-MD-BlobMD
+
+where SHA1 is the SHA-1 hash of a commit, MD is the 32-bit CRC of the
+commit (uncompressed, not including the Git object header), and BlobMD
+is a CRC of the CRCs for each blob found by traversing the commit's
+tree using a depth-first search, with blobs processed in the order
+they appear in the trees (this is a relatively fast operation because
+the CRCs of each blob are stored in the repository).
+
+Both receive-pack and upload-pack send a capability named "mds-check"
+to allow the two longer object IDs.  When the CRCs are available, the
+longer formats are used, but are generated only by fetch-pack and
+send-pack: because of backwards-compatibility constraints,
+receive-pack and upload-pack cannot determine the capabilities of
+fetch-pack and send-pack when connected to a remote repository).
+While the message digests associated with each object are computed
+once and stored, the "BlobMD" ones are computed each time - the
+"BlobMD" ones are only used by the Git commands fetch-pack,
+upload-pack, send-pack, and receive-pack.  In each case, a
+hash-collision check is performed only if the message digest used is
+available.  The collision checks during a fetch, push, or pull command
+are done by receive-pack and upload pack because send-pack and
+fetch-pack do not receive their peers' MD and BlobMD values.
+
+
+Implementation Details
+----------------------
+
+Functions to manipulate the message digest/CRC database are declared
+in the file crcdb.h.  The implementation as described above is in the
+file objd-crcdb.c: it is thus easy to change the implementation of how
+these objects are stored with minimal impact on the rest of the source
+code.
+
+In pack-write.c, there is a new function named write_mds_file with the
+same function signature as write_idx_file.  Both are called in pairs
+(write_idx_file first) so that the idx file and mds file for the
+corresponding pack file will always be created.
+
+In commit.c, there is a new function that recursively traverses the
+tree associated with a commit and finds the "blob" entries and looks
+up those entries' message digests in order to compute a message digest
+of these message digests (which is faster than computing a message
+digest of all the bytes in the blobs associated with a commit).
+
+Various function names signatures in sha1_file.c were changed to take
+two additional arguments, the first a pointer to an int used as a flag
+to indicate whether a MD/CRC exists, and the second a pointer to a
+uint32_t containing the MD/CRC (a CRC currently).  For backwards
+compatibility with previously existing functions, those functions had
+there names changed by adding "_extended" to them, with macros in
+cache.h defined so that existing code that does not need to obtain a
+MD/CRC would not be changed. There are a few additional functions
+added to sha1_file.c such as one to determine if there is an MD/CRC
+for a given SHA-1 value. Many changes in the rest of Git that result
+from this simply change the arguments to these functions.  As a
+convention, most such arguments use names like objcrc32, objcrc32p,
+has_objcrc32 and has_objcrc32p in order to make it easy to find areas
+of the code implementing hash-collision detection using the git-grep
+command.
+
+A few data structures (notably struct pack_idx_entry and struct
+packed_git) contain fields used to store has_objcrc32 and objcrc32
+values or data associated with MDS files.  These are used while
+building new MDS files.
+
+Some of the Git commands (count-objects, index-pack, and verify-pack)
+have additional command-line options related to the MD/CRCs and mds
+files. This makes it possible to explicitly name an mds file being
+created and to request that various listings show both the MD/CRC
+values in addition to SHA-1 hashes (the MD/CRC values are not listed
+by default in case user-defined scripts assume the current behavior).
+
+For C files, changes were made to the following files (compared to
+commit 017d1e13) for the initial collision-detection implementation:
+
+       * builtin/count-objects.c
+       * builtin/fetch-pack.c
+       * builtin/index-pack.c
+       * builtin/init-db.c
+       * builtin/pack-objects.c
+       * builtin/pack-redundant.c
+       * builtin/prune-packed.c
+       * builtin/prune.c
+       * builtin/receive-pack.c
+       * builtin/send-pack.c
+       * builtin/verify-pack.c
+       * commit.c
+       * environment.c
+       * fast-import.c
+       * gdbm-packdb.c (new file)
+       * git.c
+       * hex.c
+       * http.c
+       * objd-crcdb.c (new file)
+       * pack-write.c
+       * sha1_file.c
+       * upload-pack.c
+
+The other files had changes that reflected changes to function
+signatures.
+
+The header files that were modified are
+
+    * cache.h
+    * commit.h
+    * crcdb.h (new file)
+    * pack.h
+    * packdb.h (new file)
+
+where the changes are mostly new function declarations, a few macros
+for backwards-compatibility, and a few additional fields in some
+data structures.
+
+Minor changes were made to the test suite: t0000-basic.sh,
+t5300-pack-object.sh, t5304-prune.sh, t5500-fetch-pack.sh, and
+t5510-fetch.sh.
+
+The packdb functions are conditionally compiled and by default are not
+used.  When used, these use GDBM to store CRCs for SHA-1 hashes in
+cases in which the hash was not available - in this case the hash will
+be recomputed and stored for future use.  Testing indicates that
+packdb is not needed. It may be worth turning on during debugging to
+verify if a problem is discovered involving a missing MD/CRC. (As an
+aside, the packdb code is based on a test to see if GDBM would be
+efficient enough to store the MD/CRC values in general, thus avoiding
+the need to create "mds" files and reducing the number of files in the
+"crcs" directory, but it turned out that performance was not
+acceptable.)
+
+Security-Issue Details
+----------------------
+
+Without hash-collision detection, Git has a higher risk of data
+corruption due to the obvious hash-collision vulnerability, so the
+issue is really whether a usable vulnerability exists. Recent research
+has shown that SHA-1 collisions can be found in 2^63 operations or
+less.  While one result claimed 2^53 operations, the paper claiming
+that value was withdrawn from publication due to an error in the
+estimate. Another result claimed a complexity of between 2^51 and 2^57
+operations, and still another claimed a complexity of 2^57.5 SHA-1
+computations. A summary is available at
+<http://hackipedia.org/Checksums/SHA/html/SHA-1.htm#SHA-1>. Given the
+number of recent attacks, possibly by governments or large-scale
+criminal enterprises
+(<http://www.csmonitor.com/World/terrorism-security/2011/0906/Iranian-government-may-be-behind-hack-of-Dutch-security-firm>,
+<http://en.wikipedia.org/wiki/Operation_Aurora>,
+<http://en.wikipedia.org/wiki/Botnet#Historical_list_of_botnets>),
+which include botnets with an estimated 30 million computers, there is
+reason for some concern: while generating a SHA-1 collision for
+purposes of damaging a Git repository is extremely expensive
+computationally, it is possibly within reach of very well funded
+organizations. 2^32 operations, even if the operations are as
+expensive as computing a SHA-1 hash of a modest source-code file, can
+be performed in a reasonably short period of time on the type of
+hardware widely used in desktop or laptop computers at present. With
+sufficient parallelism, 30 million personal computers sufficient for
+playing the latest video games could perform 2^56 operations in a
+reasonable time.
+
+The security implications depend on how Git is used.  In the simplest
+case in which a single, shared repository is used by a number of
+developers, with source code only shared though this repository, the
+problems are minimal - Git will not allow one to insert an object
+whose SHA-1 hash matches that of an existing object.  Since an
+attacker would not know the SHA-1 hash until the correct object is
+already in the shared repository, all an attacker would succeed in
+doing is to create some confusion in his/her private repository and
+working copy (but note that some of the assumptions break down if
+developers email source files between themselves rather than transferring
+everything via a Git repository).
+developers, with source code only shared though this repository, the
+problems are minimal - Git will not allow one to insert an object
+whose SHA-1 hash matches that of an existing object.  Since an
+attacker would not know the SHA-1 hash until the correct object is
+already in the shared repository, all an attacker would succeed in
+doing is to create some confusion in his/her private repository and
+working copy (but note that some of the assumptions break down if
+developers email source files between themselves rather than transferring
+everything via a Git repository).
+
+In other cases, however, there could be an issue if a SHA-1 collision
+can be created quickly enough. As an example, suppose one is using the
+"Integration-manager workflow" described in "Pro Git" (see
+<http://progit.org/book/ch5-1.html>) with a "blessed repository" BR
+and two public developer repositories DPR1 and DPR2.  Suppose a
+developer puts a legitimate change to the code into DPR2. Another
+developer with read access to DPR2 and write access to DPR1
+immediately fetches this commit, and replaces one file with a modified
+version of the same size that has the same SHA-1 value, and then puts
+the commit into DPR1 (the change may be a trivial one designed to
+introduce an obscure buffer overflow error that can be exploited and
+may not stand out in a code review).  At this point DPR1 and DPR2 have
+identical commits (i.e., the same commit object) but with one file
+modified.  No local test on either DPR1 or DPR2 will uncover a
+discrepancy based on SHA-1 values, object contents, and digital
+signatures for commits.  Further assume that both developers send an
+email to the "integration manager" notifying him/her of the changes.
+
+Now suppose the above changes occurred late on a Friday afternoon and
+over a weekend.  On the next Monday morning, the "integration manager"
+reads the emails and pulls changes from DPR1 and then DPR2.  Once a
+'fetch' from DPR1 is complete, a subsequent fetch from DBR2 will not
+transfer any data related to the commit, as the integration manager's
+repository already has a commit with the matching SHA-1 value.  A
+modified copy will have been introduced into the system, and the
+developers using DPR2 may never notice the difference - as they will
+pull from DPR2 more often than BR, they will most likely have the
+correct files and the commit in their local repositories and the
+transfer protocol will avoid sending the commit if it is already
+available.  Furthermore, if the modified file introduces a hard-to-find
+buffer overflow testing may not uncover the problem.
+
+If the file in question is modified again, and a thin pack is used in
+a fetch so that the change is delta-encoded, the SHA-1 hashes may
+differ after de-deltafication, but such a change might not be
+introduced before a release.  While obtaining a different SHA-1
+hash than expected would be detected, the error would be a corrupted
+repository - a missing SHA-1 value in the tree associated with a commit.
+It would take some effort to figure out what went wrong - what
+happens depends on the state of both the client and server when a
+git-fetch operation runs.
+
+As a justification for the scenarios just described, one can use MD5
+as a model: http://www.mscs.dal.ca/~selinger/md5collision/ gives the
+following two 128-byte sequences (expressed in hexadecimal) as ones
+with the same MD5 hash:
+
+d131dd02c5e6eec4693d9a0698aff95c 2fcab58712467eab4004583eb8fb7f89
+55ad340609f4b30283e488832571415a 085125e8f7cdc99fd91dbdf280373c5b
+d8823e3156348f5bae6dacd436c919c6 dd53e2b487da03fd02396306d248cda0
+e99f33420f577ee8ce54b67080a80d1e c69821bcb6a8839396f9652b6ff72a70
+
+and
+
+d131dd02c5e6eec4693d9a0698aff95c 2fcab50712467eab4004583eb8fb7f89
+55ad340609f4b30283e4888325f1415a 085125e8f7cdc99fd91dbd7280373c5b
+d8823e3156348f5bae6dacd436c919c6 dd53e23487da03fd02396306d248cda0
+e99f33420f577ee8ce54b67080280d1e c69821bcb6a8839396f965ab6ff72a70
+
+These can be used for test purposes.  If you change the first 16 bytes
+of both sequences to the same new value, before the point where the
+files differ, you will get different MD5 hash values.  If you append
+the same text to both files, the MD5 values of the modified files are
+equal (and, of course, different from the previous value) but
+fortunately Git includes the length of an object's contents in the
+object's header, and the SHA-1 hash includes the header).  In most
+cases, whever a file is modified during software development, the
+file's length will change.  This causes a change early enough in the
+object so that applying the same patch to two different files that
+have the same SHA-1 value will typically result in two files with
+different SHA-1 values.  So, the result of a hash collision when
+multiple remote repositories are used would initially be different
+versions of the same file for the same commit in different
+repositories, possibly followed by some of the repositories being
+corrupted as one of these files is modified, depending on the state
+of each server and client when a fetch or pull operation is run.
diff --git a/Documentation/technical/pack-format.txt b/Documentation/technical/pack-format.txt
index 1803e64..4dfaf92 100644
--- a/Documentation/technical/pack-format.txt
+++ b/Documentation/technical/pack-format.txt
@@ -158,3 +158,43 @@ Pack file entry: <+
     corresponding packfile.
 
     20-byte SHA1-checksum of all of the above.
+
+
+= pack-*.mds files contain message digests for objects (CRCs
+  initially, with other options possibly added at a later date such as
+  SHA-256).  The digests are stored in the same order as the sha-1 values
+  in the matching idx file.  These files have the following format:
+
+  - A 6-byte magic number consisting the the characters "PKMDS" followed
+    by a NULL character (0).
+
+  - A one-byte version number (= 1)
+
+  - A one-byte field-length value for message digest fields, in units of
+    4-byte words, with the legal value being 1. The length of the message
+    digest fields in bytes is denoted as wsize below).
+
+  - A set of blocks, each of which contains 4 entries encoded as follows:
+
+      * four one-byte fields, one per entry, for which a zero value
+	indicates that a matching entry does not exist and for which a
+	value of 1 indicates that the field contains a 32-bit CRC stored
+	in network byte order.
+
+      * 4 wsize-byte fields, one per entry, each containing a CRC
+	(by conventionwhich should be 0 if the CRC does not exist).
+	For each field, the data it contains should start at the first
+	byte, padded with NULL characters if the field is longer than
+	the digest it stores.
+
+    For the set of all blocks, the nth one-byte field and the nth 4-byte
+    field store the values for the nth entry in the file. The format
+    ensures that each message digest starts on a 32-bit boundary,
+    allowing 32-bit integer operations to be used in copying or
+    comparing values.
+
+  - A 20 byte SHA-1 hash of the SHA-1 hashes naming the objects whose
+    message digests are being stored, in the same order as they
+    appear in the corresponding idx file.
+
+  - A 20 byte SHA-1 hash of all of the above.
-- 
1.7.1


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html