Re: Smart fetch via HTTP?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, May 17, 2007 at 10:41:37 -0400, Nicolas Pitre wrote:
> On Thu, 17 May 2007, Johannes Schindelin wrote:
> > On Wed, 16 May 2007, Nicolas Pitre wrote:
> And if you have 1) the permission and 2) the CPU power to execute such a 
> cgi on the server and obviously 3) the knowledge to set it up properly, 
> then why aren't you running the Git daemon in the first place?  After 
> all, they both boil down to running git-pack-objects and sending out the 
> result.  I don't think such a solution really buys much.

Yes, it does. I had 2 accounts where I could run CGI, but not separate
server, at university while I studied and now I can get the same on friend's
server. Neither of them would probably be ok for serving larger busy git
repository, but something smaller accessed by several people is OK. I think
this is quite common for university students.

Of course your suggestion which moves the logic to client-side is a good one,
but even the cgi with logic on server side would help in some situations.

> On the other hand, if the client does all the work and provides the 
> server with a list of ranges within a pack it wants to be sent, then you 
> simply have zero special setup to perform on the hosting server and you 
> keep the server load down due to not running pack-objects there.  That, 
> at least, is different enough from the Git daemon to be worth 
> considering.  Not only does it provide an advantage to those who cannot 
> do anything but http out of their segregated network, but it also 
> provide many advantages on the server side too while the cgi approach 
> doesn't.
> 
> And actually finding out the list of objects the remote has that you 
> don't have is not that complex.  It could go as follows:
> 
> 1) Fetch every .idx files the remote has.

... for git it's 1.2 MiB. And that definitely isn't a huge source tree.
Of course the local side could remember which indices it already saw during
previous fetch from that location and not re-fetch them.

A slight problem is, that git-repack normally recombines everything to
a single pack, so the index would have to be re-fetched again anyway.

> 2) From those .idx files, keep only a list of objects that are unknown 
>    locally.  A good starting point for doing this really efficiently is 
>    the code for git-pack-redundant.
> 
> 3) From the .idx files we got in (1), create a reverse index to get each 
>    object's size in the remote pack.  The code to do this already exists 
>    in builtin-pack-objects.c.
> 
> 4) With the list of missing objects from (2) along with their offset and 
>    size within a given pack file, fetch those objects from the remote 
>    server.  Either perform multiple requests in parallel, or as someone 
>    mentioned already, provide the server with a list of ranges you want 
>    to be sent.

Does the git server really have to do so much beyond that? I didn't look at
the algorithm that finds what deltas should be based on, but depending on
that it might (or might not) be possible to proof the client has everything to
understand if the server sends the objects as it currently has them.

> 5) Store the received objects as loose objects locally.  If a given 
>    object is a delta, verify if its base is available locally, or if it 
>    is listed amongst those objects to be fetched from the server.  If 
>    not, add it to the list.  In most cases, delta base objects will be 
>    objects already listed to be fetched anyway.  To greatly simplify 
>    things, the loose delta object type from 2 years ago could be revived 
>    (commit 91d7b8afc2) since a repack will get rid of them.
> 
> 6 Repeat (4) and (5) until everything has been fetched.

Unless I am really seriously missing something, there is no point in
repeating. For each pack you need to unpack a delta either:
 - you have it => ok.
 - you don't have it, but the server does =>
    but than it's already in the fetch set calculated in 2.
 - you don't have it and nor does server =>
    the repository at server is corrupted and you can't fix it.

> 7) Run git-pack-objects with the list of fetched objects.
> 
> Et voilà.  Oh, and of course update your local refs from the remote's.
> 
> Actually there is nothing really complex in the above operations. And 
> with this the server side remains really simple with no special setup 
> nor extra load beyond the simple serving of file content.

On the other hand the amount of data transfered is larger, than with the git
server approach, because at least the indices have to be transfered in
entirety. So each approach has it's own advantages.

-- 
						 Jan 'Bulb' Hudec <bulb@xxxxxx>

Attachment: signature.asc
Description: Digital signature


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux