Re: Smart fetch via HTTP?

Nicolas Pitre <nico@xxxxxxx> · Wed, 16 May 2007 23:45:30 -0400 (EDT)

On Wed, 16 May 2007, Shawn O. Pearce wrote:

> Johannes Schindelin <Johannes.Schindelin@xxxxxx> wrote:
> > Don't forget that those 10% probably do not do you the favour to be in 
> > large chunks. Chances are that _every_ _single_ wanted object is separate 
> > from the others.
> 
> That's completely possible.  Assuming the objects even are packed
> in the first place.  Its very unlikely that you would be able to
> fetch very large of a range from an existing packfile, you would be
> submitting most of your range requests for very very small sections.

Well, in the commit objects case you're likely to have a bunch of them 
all contigous.

For tree and blob objects it is less likely.

And of course there is the question of deltas for which you might or 
might not have the base object locally already.

Still... I wonder if this could be actually workable.  A typical daily 
update on the Linux kernel repository might consist of a couple hundreds 
or a few tousands objects.  This could still be faster to fetch parts of 
a pack than the whole pack if the size difference is above a certain 
treshold.  It is certainly not worse than fetching loose objects.

Things would be pretty horrid if you think of fetching a commit object, 
parsing it to find out what tree object to fetch, then parse that tree 
object to find out what other objects to fetch, and so on.

But if you only take the approach of fetching the pack index files, 
finding out about the objects that the remote has that are not available 
locally, and then fetching all those objects from within pack files 
without even looking at them (except for deltas), then it should be 
possible to issue a couple requests in parallel and possibly have decent 
performances.  And if it turns out that more than, say, 70% of a 
particular pack is to be fetched (you can determine that up front), then 
it might be decided to fetch the whole pack.

There is no way to sensibly keep those objects packed on the receiving 
end of course, but storing them as loose objects and repacking them 
afterwards should be just fine.

Of course you'll get objects from branches in the remote repository you 
might not be interested in, but that's a price to pay for such a hack.  
On average the overhead shouldn't be that big anyway if branches within 
a repository are somewhat related.

I think this is something worth experimenting.

Nicolas
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html