On Mon, May 25, 2009 at 7:54 PM, Avery Pennarun <apenwarr@xxxxxxxxx> wrote: > On Mon, May 25, 2009 at 1:35 PM, Asger Ottar Alstrup <asger@xxxxxxxx> wrote: >> So a poor mans system could work like this: >> >> - A reduced repository is defined by a list of paths in a file, I >> guess with a format similar to .gitignore > > Are you sure you want to define the list with exclusions instead of > inclusions? I don't really know your use case. Since the .gitignore format supports !, I believe that should not make much of a difference. > Anyway, if you're using git filter-branch, it'll be up to you to fix > the index to contain the list of files you want. (See man > git-filter-branch) Yes, sure, and that is why I asked whether there is some tool in git that can give a list of concrete files surviving a .gitignore list of patterns. >> - To extract: A copy of the original repository is made. This copy is >> reduced using git filter-branch. Is there some way of turning a >> .gitignore syntax file into a concrete list of files? Also, can this >> entire step be done in one step without the copy? Having to copy the >> entire project first seems excessive. Will filter-branch preserve >> and/or prune pack files intelligently? > > You probably need to read about the differences between git trees, > blobs, and commits. You're not actually "copying" anything; you're > just creating some new directory structures that contain the > *existing* blobs. And of course the existing blobs are in your > existing packs. Thanks. OK, I see now that filter-branch will not destroy the original repository. That is not at all obvious from reading the man page, when the very first sentence says that it will rewrite history. But the main point of this exercise is to reduce the size of the reduced repository so that it can be transferred effectively. So after filter-branch, I guess I would run clone afterwards to make the new, smaller repository, and then the question becomes: Will clone reuse and prune packs intelligently? > Well, you're getting pretty far out there: > > - git is known to work badly with large files, and you have a bunch of > large files; As far as I know, git has most of the hooks needed to tune this. There are still some weak areas where big files are read into memory multiple times, but I have seen that people are already working on this. > - git is intended to manage entire repositories at a time, and you > want a partial checkout; The beauty of the subtree-inspired approach is of course that the users of the reduced repositories WILL in fact be working on an entire repository. The files are luckily fairly independent in THEIR workflow. Also, if the mirror-sync proposal gets implemented, one important part of the distribution piece is also solved: In effect, these systems combined would give us a kind of narrow-clone. > - git is intended to download the entire history at once, and you (I > think) only want part of it. I do need the entire history for the reduced files. > By the time you're this far out, maybe what you want isn't git at all. > svn would work fine with this arrangement, and people who want > partial checkouts would rarely benefit from git's distributedness > anyway, I expect. In my use case, some people will need to work on the full repository, and they obviously will have the network and the machines to handle this. I am currently thinking these people would use something like glusterfs until mirrorsync is able to solve the problem for us. However, there is a large group of users that do not need this, but they DO need the entire history of the files they are interested in. Subversion does not provide this. Also, Subversion is simply too slow to handle the kind of files we need to work with. Also, we have run tests on the kind of files we have, and the delta compression that git uses is very effective for compression the pdf and openoffice documents we use. The big files we have are primarily image files, and obviously they do not compress very well. Fortunately, they do not change much either. While git might not currently be designed to support this use case, it still seems like the best system to base this on. Yes, it will need some work before we can use it for our needs, but it seems it is still less work than what is needed to get other systems to support our needs. I appreciate your comments. They are very helpful. Regards, Asger -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html