Re: [PATCH 1/3] Lazily open pack index files on demand

A Large Angry SCM <gitzilla@xxxxxxxxx> · Sun, 27 May 2007 22:30:17 -0400

Dana How wrote:
[...]

Some history of what I've been doing with git:
First I simply had to import the repo,
which led to split packs (this was before index v2).
Then maintaining the repo led to the unfinished maxblobsize stuff.
Distributing the repo included users pulling (usually) from the central 
repo,
which would be trivial since it was also an alternate.
Local repacking would avoid heavy load on it.

Now I've started looking into how to push back into the
central repo from a user's repo (not everything will be central;
some pulling between users will occur
otherwise I wouldn't be as interested).

It looks like the entire sequence is:
A. git add file [compute SHA-1 & compress file into objects/xx]
B. git commit [write some small objects locally]
C. git push {using PROTO_LOCAL}:
1. read & uncompress objects
2. recompress objects into a pack and send through a pipe
3. read pack on other end of pipe and uncompress each object
4. compute SHA-1 for each object and compress file into objects/xx

So, after creating an object in the local working tree,
to get it into the central repo,  we must:
compress -> uncompress -> compress -> uncompress -> compress.
In responsiveness this won't compare very well to Perforce,
which has only one compress step.

The sequence above could be somewhat different currently in git.
The user might have repacked their repo before pushing,
but this just moves C1 and C2 back earlier in time,
it doesn't remove the need for them.  Besides,  the blobs in
a push are more likely to be recent and hence unpacked.

Also,  C3 and C4 might not happen if more than 100 blobs get pushed.
But this seems very unusual; only 0.3% of commits in the history
had 100+ new files/file contents.  If the 100 level is reduced,
then the central repo fills up with packfiles and their index files,
reducing performance for everybody (using the central repo as an 
alternate).

Thus there really is 5X more compression activity going on
compared to Perforce.  How can this be reduced?

One way is to restore the ability to write the "new" loose object format.
Then C1, C2, and C4 disappear.  C3 must remain because we need
to uncompress the object to compute its SHA-1;  we don't need
to recompress since we were already given the compressed form.

And that final sentence is why I sent this email:  if the packfile
contained the SHA-1s,  either at the beginning or before each object,
then they wouldn't need to be recomputed at the receiving end
and the extra decompression could be skipped as well.  This would
make the total zlib effort the same as Perforce.

The fact that a loose object is never overwritten would still be retained.
Is that sufficient security?  Or does the SHA-1 always need to be
recomputed on the receiving end?  Could that be skipped just for
specific connections and/or protocols (presumably "trusted" ones)?
[...]

So how do you want to decide when to trust the sender and when to 
validate that the objects received have the SHA-1's claimed? A _central_ 
repository, being authoritative, would need to _always_ validate _all_ 
objects it receives. An since, with a central repository setup, the 
central repository is where the CPU resources are the most in demand, 
validating the object IDs when received at the developers repositories 
should not be a problem. And just to be fair, how does Perforce 
guarantee that the retrieved version of a file matches what was checked in?
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html