Re: git subtree as a solution to partial cloning?

Asger Ottar Alstrup <asger@xxxxxxxx> · Mon, 25 May 2009 20:28:18 +0200

On Mon, May 25, 2009 at 7:54 PM, Avery Pennarun <apenwarr@xxxxxxxxx> wrote:
> On Mon, May 25, 2009 at 1:35 PM, Asger Ottar Alstrup <asger@xxxxxxxx> wrote:
>> So a poor mans system could work like this:
>>
>> - A reduced repository is defined by a list of paths in a file, I
>> guess with a format similar to .gitignore
>
> Are you sure you want to define the list with exclusions instead of
> inclusions?  I don't really know your use case.

Since the .gitignore format supports !, I believe that should not make
much of a difference.

> Anyway, if you're using git filter-branch, it'll be up to you to fix
> the index to contain the list of files you want. (See man
> git-filter-branch)

Yes, sure, and that is why I asked whether there is some tool in git
that can give a list of concrete files surviving a .gitignore list of
patterns.

>> - To extract: A copy of the original repository is made. This copy is
>> reduced using git filter-branch. Is there some way of turning a
>> .gitignore syntax file into a concrete list of files? Also, can this
>> entire step be done in one step without the copy? Having to copy the
>> entire project first seems excessive. Will filter-branch preserve
>> and/or prune pack files intelligently?
>
> You probably need to read about the differences between git trees,
> blobs, and commits.  You're not actually "copying" anything; you're
> just creating some new directory structures that contain the
> *existing* blobs.  And of course the existing blobs are in your
> existing packs.

Thanks. OK, I see now that filter-branch will not destroy the original
repository. That is not at all obvious from reading the man page, when
the very first sentence says that it will rewrite history. But the
main point of this exercise is to reduce the size of the reduced
repository so that it can be transferred effectively. So after
filter-branch, I guess I would run clone afterwards to make the new,
smaller repository, and then the question becomes: Will clone reuse
and prune packs intelligently?

> Well, you're getting pretty far out there:
>
> - git is known to work badly with large files, and you have a bunch of
> large files;

As far as I know, git has most of the hooks needed to tune this. There
are still some weak areas where big files are read into memory
multiple times, but I have seen that people are already working on
this.

> - git is intended to manage entire repositories at a time, and you
> want a partial checkout;

The beauty of the subtree-inspired approach is of course that the
users of the reduced repositories WILL in fact be working on an entire
repository. The files are luckily fairly independent in THEIR
workflow. Also, if the mirror-sync proposal gets implemented, one
important part of the distribution piece is also solved: In effect,
these systems combined would give us a kind of narrow-clone.

> - git is intended to download the entire history at once, and you (I
> think) only want part of it.

I do need the entire history for the reduced files.

> By the time you're this far out, maybe what you want isn't git at all.
>  svn would work fine with this arrangement, and people who want
> partial checkouts would rarely benefit from git's distributedness
> anyway, I expect.

In my use case, some people will need to work on the full repository,
and they obviously will have the network and the machines to handle
this. I am currently thinking these people would use something like
glusterfs until mirrorsync is able to solve the problem for us.

However, there is a large group of users that do not need this, but
they DO need the entire history of the files they are interested in.
Subversion does not provide this. Also, Subversion is simply too slow
to handle the kind of files we need to work with. Also, we have run
tests on the kind of files we have, and the delta compression that git
uses is very effective for compression the pdf and openoffice
documents we use. The big files we have are primarily image files, and
obviously they do not compress very well. Fortunately, they do not
change much either.

While git might not currently be designed to support this use case, it
still seems like the best system to base this on. Yes, it will need
some work before we can use it for our needs, but it seems it is still
less work than what is needed to get other systems to support our
needs.

I appreciate your comments. They are very helpful.

Regards,
Asger
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html