Re: [PATCH] RFC: git lazy clone proof-of-concept

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 08 Feb 2008 11:00:55 -0800 (PST)

Jan Holesovsky <kendy@xxxxxxx> writes:

> This is my attempt to implement the 'lazy clone' I've read about a
> bit in the git mailing list archive, but did not see implemented
> anywhere - the clone that fetches a minimal amount of data with the
> possibility to download the rest later (transparently!) when
> necessary.

It was not implemented because it was thought to be hard; git assumes
in many places that if it has an object, it has all objects referenced
by it.

But it is very nice of you to [try to] implement 'lazy clone'/'remote
alternates'.

Could you provide some benchmarks (time, network throughtput, latency)
for your implementation?

> Currently we are evaluating the usage of git for OpenOffice.org as
> one of the candidates (SVN is the other one), see
> 
>   http://wiki.services.openoffice.org/wiki/SCM_Migration
> 
> I've provided a git import of OOo with the entire history; the
> problem is that the pack has 2.5G, so it's not too convenient to
> download for casual developers that just want to try it.

One of the reasons why 'lazy clone' was not implemented was the fact
that by using large enough window, and larger than default delta
length you can repack "archive pack" (and keep it from trying to
repack using .keep files, see git-config(1)) much tighter than with
default (time and CPU conserving) options, and much, much tighter than
pack which is result of fast-import driven import.

Both Mozilla import, and GCC import were packed below 0.5 GB. Warning:
you would need machine with large amount of memory to repack it
tightly in sensible time!

> Shallow clone is not a possibility - we don't get patches through
> mailing lists, so we need the pull/push, and also thanks to the OOo
> development cycle, we have too many living heads which causes the
> shallow clone to download about 1.5G even with --depth 1.

Wouldn't be easier to try to fix shallow clone implementation to allow
for pushing from shallow to full clone (fetching from full to shallow
is implemented), and perhaps also push/pull between two shallow
clones?

As to many living heads: first, you don't need to fetch all
heads. Currently git-clone has no option to select subset of heads to
clone, but you can always use git-init + hand configuration +
git-remote and git-fetch for actual fetching.

By the way, did you try to split OpenOffice.org repository at the
components boundary into submodules (subprojects)? This would also
limit amount of needed download, as you don't neeed to download and
checkout all subprojects. 

The problem of course is _how_ to split repository into
submodules. Submodules should be enough self contained so the
whole-tree commit is alsays (or almost always) only about submodule.

> Lazy clone sounded like the right idea to me.  With this
> proof-of-concept implementation, just about 550M from the 2.5G is
> downloaded, which is still about twice as much in comparison with
> downloading a tarball, but bearable.

Do you have any numbers for OOo repository like number of revisions,
depth of DAG of commits (maximum number of revisions in one line of
commits), number of files, size of checkout, average size of file,
etc.?

-- 
Jakub Narebski
Poland
ShadeHawk on #git
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html