Re: Why is "git tag --contains" so slow?

Jakub Narebski <jnareb@xxxxxxxxx> · Thu, 8 Jul 2010 23:20:03 +0200

Dnia czwartek 8. lipca 2010 22:13, Nicolas Pitre napisał:
> On Thu, 8 Jul 2010, Avery Pennarun wrote:
> > On Thu, Jul 8, 2010 at 3:29 PM, Nicolas Pitre <nico@xxxxxxxxxxx> wrote:

> > > I might be looking at this from my own perspective as one of the few
> > > people who hacked extensively on the Git pack format from the very
> > > beginning.  But I do see a way for the pack format to encode commit and
> > > tree objects so that walking them would be a simple lookup in the pack
> > > index file where both the SHA1 and offset in the pack for each parent
> > > can be immediately retrieved.  Same for tree references.  No deflating
> > > required, no binary search, just simple dereferences.  And the pack size
> > > would even shrink as a side effect.
> > 
> > One trick that bup uses is an additional file that sits alongside the
> > pack and acts as an index.  In bup's case, this is to work around
> > deficiencies in the .idx file format when using ridiculously huge
> > numbers of objects (hundreds of millions) across a large number of
> > packfiles.  But the same concept could apply here: instead of doing
> > something like rev-cache, you could just construct the "efficient"
> > part of the packv4 format (which I gather is entirely related to
> > commit messages), and store it alongside each pack.
> 
> No.  I want the essential information in an efficient encoding _inside_ 
> the pack, actually replacing the existing encoding.  One of the goal is 
> also to reduce repository size, not to grow it.

That's a good idea.

> > This would allow people to incrementally modify git to use the new,
> > efficient commit object storage, without breaking backward
> > compatibility with earlier versions of git.  (Just as bup can index
> > huge numbers of packed objects but still stores them in the plain git
> > pack format.)
> 
> Initially, what I'm aiming for is for pack-objects to produce the new 
> format, for index-pack to grok it, and for sha1_file:unpack_entry() to 
> simply regenerate the canonical object format whenever a pack v4 object 
> is encountered.  Also pack-objects would be able to revert the object 
> encoding to the current format on the fly when it is serving a fetch 
> request to a client which is not pack v4 aware, just like we do now with 
> the ofs-delta capability.
> 
> Once that stage is reached, I'll submit the lot and hope that other 
> people will help incrementally converting part of Git to benefit from 
> native access to the pack v4 data.  The tree object walk code would be 
> the first obvious candidate.  And so on.

If I remember correctly with pack v4 some operations like getting size
of tree object needs encoding to current format, so they are slower than
they should be (and perhaps a bit slower than current implementation).
But that should be I think rare (well, unless one streams to 
'git cat-file --batch / --batch-check').

Would pack v4 need index v4?

By the way, rev-cache project was started mainly to make "counting
objects" part of clone / fetch faster.  Would pack v4 offer the same
without rev-cache?

-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html