Re: GSoC 2009 Prospective student

Nicolas Pitre <nico@xxxxxxx> · Mon, 23 Feb 2009 11:31:38 -0500 (EST)

On Mon, 23 Feb 2009, Shawn O. Pearce wrote:

> Jakub Narebski <jnareb@xxxxxxxxx> wrote:
> > Nicolas Pitre <nico@xxxxxxx> writes:
> > > On Sun, 22 Feb 2009, Miklos Vajna wrote: 
> > > > 
> > > > http://thread.gmane.org/gmane.comp.version-control.git/55254/focus=55298
> > > > 
> > > > Especially Shawn's message, which can be a base for your proposal, if
> > > > you want to work in this.
> > > 
> > > I don't particularly agree with Shawn's proposal.  Reliance on a stable 
> > > sorting on the server side is too fragile, restrictive and cumbersome.
> 
> We already rely on a stable sort in the tree format.  Asking that
> a stable sort be applied when a clone is started so that we can
> later resume it isn't unreasonable.  Hell, that tree format sort
> is a B***H anyway, its not a simple sort by memcmp().  Almost every
> Git re-implementation gets it wrong the first time out.

That's not the issue at all.  The sorting within a single tree object is 
indeed well defined (even if it is arguably a bit odd).  The object 
order is not, and now with threaded delta the list of actually deltified 
objects may and do vary from successive packing of the same repo.  
Committing ourselves to determinism here just for the sake of a 
restartable clone is not something I subscribe to.

> > > Restartable clone is _hard_.  Even I who has quite a bit of knowledge in 
> > > the affected area didn't find a satisfactory solution yet.
> 
> Sure, its difficult, but nobody has put effort into it either.
> I think it could be done by enforcing a stable sort during clone
> (and perhaps only during clone).

We should aim for a real solution, not something that is "special" for a 
clone.  After all, a clone is just a fetch, and large fetches may be 
interrupted too.

> > I think it is possible for dumb protocols (using commit walkers) and
> > for (deprecated) rsync.
> 
> Yes, it is possible for the commit walkers to implement a restart,
> as they are actually beginning at the current root and walking back
> in history.  Resuming a large file like a pack is easy to do on HTTP
> if the remote server supports byte range serving.  Its also easy
> to validate on the client that the pack wasn't repacked during the
> idle period (between initial fetch and restart), just validate the
> SHA-1 footer.  If the pack was repacked and came up with the same
> name you'll have a mismatch on the footer.  Discard and try again.

Sure, dumb protocols are easy.  It's one of the few advantages they have 
over the native protocol.

> But clients can already abuse a server far more by repeatedly doing
> a clone, and then break the network connection as soon as the PACK
> header comes down the wire.  The server just spent a lot of CPU
> and IO time building the complete list of the objects to transmit.
> Its really a non-trivial load on the server side.  And by having
> the client break the pipe at the 'PACK' header, the client doesn't
> have to absorb the large data transfer either.  Making it fairly
> easy to DOS a Git daemon with a small botnet.

This is easy to fix, and something I've posted design notes about a 
while ago.  A cache of generated packs can be made, indexed by a hash of 
the wanted/excluded refs used for pack generation.  This way popular 
fetches (say after Linus pushes stuff to his tree and everyone else 
fetches it at night) would require computation only once.  That is I 
think something more suitable for a SOC student project.

Of course willfully abusing a git server can be done despite of this, 
but that is true for any other service as well.

> That ideas page is a wiki for a reason.  If folks feel differently
> from me, please edit it to improve things!  :-)

/me hates editing wiki pages...  :-/

Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html