Re: If you would write git from scratch now, what would you change?

Nicolas Pitre <nico@xxxxxxx> · Mon, 26 Nov 2007 15:55:43 -0500 (EST)

On Mon, 26 Nov 2007, Dana How wrote:

> On Nov 26, 2007 11:52 AM, Nicolas Pitre <nico@xxxxxxx> wrote:
> > On Mon, 26 Nov 2007, Dana How wrote:
> > > Currently data can be quickly copied from pack to pack,
> > > but data cannot be quickly copied blob->pack or pack->blob
> > I don't see why you would need the pack->blob copy normally.
> True,  but that doesn't change the main point.

Sure, but let's not go overboard either.

> > > (there was an alternate blob format that supported this,
> > >  but it was deprecated).  Using the pack format for blobs
> > > would fix this.
> >
> > Then you can do just that for big enough blobs where "big enough" is
> > configurable: encapsulate them in a pack instead of a loose object.
> > Problem solved.  Sure you'll end up with a bunch of packs containing
> > only one blob object, but given that those blobs are so large to be a
> > problem in your work flow when written out as loose objects, then they
> > certainly must be few enough not to cause an explosion in the number of
> > packs.
> Are you suggesting that "git add" create a new pack containing
> one blob when the blob is big enough?

Exactly.

> Re-using (part of) the pack format
> in a blob (or maybe only some blobs) seems like less code change.

Don't know what you mean exactly here, but what I mean is to do 
something as simple as:

	pretend_sha1_file(...);
	add_object_entry(...);
	write_pack_file();

when the buffer to make a blob from is larger than a configured 
treshold.

> > > It would also mean blobs wouldn't need to
> > > be uncompressed to get the blob type or size I believe.
> >
> > They already don't.
> It looks like sha1_file.c:parse_sha1_header() works on a buffer
> filled in by sha1_file.c:unpack_sha1_header() by calling inflate(), right?
> 
> It is true you don't have to uncompress the *entire* blob.

Right.  Only the first 16 bytes or so need to be uncompressed.

> > > The equivalent operation in git would require the creation of
> > > the blob,  and then of a temporary pack to send to the server.
> > > This requires 3 calls to zlib for each blob,  which for very
> > > large files is not acceptable at my site.
> >
> > I currently count 2 calls to zlib, not 3.
> I count 3:
> 
> Call 1: git-add calls zlib to make the blob.
> 
> Call 2: builtin-pack-objects.c:write_one() calls sha1_file.c:read_sha1_file()
> calls :unpack_sha1_file() calls :unpack_sha1_{header,rest}() calls
> inflate() to get the data from the blob into a buffer.
> 
> Call 3: Then write_one() calls deflate to make the new buffer
> to write into the pack.  This is all under the "if (!to_reuse) {" path,
> which is active when packing a blob.

Oh, you're right.  Somehow I didn't count the needed decompression.

> Remember,  I'm comparing "p4 submit file" to
> "git add file"/"git commit"/"git push",  which is the comparison
> the users will be making.
> 
> On the other hand,  I'm looking at code from June;
> but I haven't noticed big changes since then on the list.
> 
> Calls 2 and 3 go away if the blob and pack formats were more similar.

... which my suggestion should provide with a minimum of changes, maybe 
less than 10 lines of code.

Nicolas
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html