On 4/11/07, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
On Wed, 11 Apr 2007, David Lang wrote: > this is why I was suggesting a --multiple-project option to let you tell fsck > about all of the repositories that it needs to look for refs in. Well, just from a personal observation: - I would *personally* actually refuse to share objects with anybody else. I just find the idea too scary. Somebody doing something bad to their object store by mistake (running "git prune" without realizing that there are *my* objects there too, or just deciding that they want to play with the object directory by hand, or running a new fancy experimental importer that has a subtle bug wrt object handling or anything like that). I'll endorse use "alternates" files, but partly because I know the main project is safe (any alternates usage is in the "satellite" clones anyway, and they will never write to the alternate object directory), and partly because at least for the kernel, we don't have branches that get reset in the main project, so there's no reason to fear that a "git repack -a -d" will ever screw up any of the satellite repositories even by mistake. But for git projects, even alternates isn't safe, in case somebody bases their own work on a version of "pu" that eventually goes away (even with reflogs, pruning *eventually* takes place). So I tend to think that alternates and shared object directories are really for "temporary" stuff, or for *managed* repositories that are at git *hosting* sites (eg repo.or.cz), and where there is some other safety involved, ie users don't actually access the object directories directly in any way. So I've at least personally come to the conclusion that for a *developer* (as opposed to a hosting site!), shared object directories just never make sense. The downsides are just too big. Even alternates is something where you just need to be fairly careful!
These arguments all seem pretty convincing to me -- maybe the problem is that I'm not a "*developer*" right now. Instead I'm part of a multi-developer *site*. Below I talk about a possible way we could use git without changing it (since I recognize this would be a minority usage pattern). We use perforce to manage a mixed hardware/software project (I'm the 55GB check-out guy, remember?). We have at least 3 different kinds of data with different usage patterns, and using perforce for everything in one centralized server was not the best solution. Each user ("client") has their own worktree and the perforce repository is on a shared central server. You can consider perforce to have the equivalent of git's index, but it is stored on the server, in one file ("db.have") covering all clients. Obviously that becomes a bottleneck -- and recently db.have got larger than the total cache RAM on the server, which really slowed things down until we moved to a larger server. But repository architecture aside, the real problem has been perforce's usability. Frequently one contributor, having gotten ahead of the team, needs to share this more recent work with only a few people. This could be done with p4 branching, but this is really clunky. So instead the work is pushed out (submitted) to everyone, causing instability; this is partially remedied by doing it in smaller chunks. Another perforce problem is that tagging consumes a lot of server space (and may slow things down as well). Some of this data will stay in perforce, some will move into revision control built-in to some of our other tools, and I'd like to try to move some of it into git. The main attraction for the last group is the lightweight branching that would allow early/tentative work to be easily shared. I think the subproject work currently being discussed is going to be very helpful as well -- the perforce equivalent is chaotic. We could give each user a work tree and an object repository, and then have a "release" repository. Unfortunately, this would be slower to use than the current perforce "solution": users would check in to their local repository, at the speed of gzip, anyone checking it out would do so at the speed of gzip, and all work would need to be resubmitted (using perforce jargon here) to the central repo, again at the speed of gzip. Currently, people either submit or check out from the central repo, and it's all done at the speed of a network copy. This speed issue is important because of the size of a commit we'd like to share (but not yet release): about 40 files, half of them control files of several KB each, 1/4 of them design files of several MB each, and the last 1/4 detailed design files 100X larger. These 40 files will reference (include) 50 others of several KB each sprinkled through-out the hierarchy, a few of which might have changed. And yes, almost all of these are generated files, but the generation time, and the instability of the tool and script environment, preclude forcing the other users to regenerate them, like you would with a .o file. So, there are 2 alternative set-ups. In one, everyone uses a shared object repository (everyone's .git/objects is a symlink to it). In this repository, objects/. , objects/?? , objects/pack , and objects/info all have "sticky" set, and we do the appropriate machinations to make all files read-only. There would be an additional phantom user "git" who owns the shared object repository (the only user whose .git/objects is not a symlink). Users would commit to their own repositories, which would write data to the shared object repository and update their refs (e.g. HEAD). To "release", push to the ~git repository. This push would be like a current push -- fast-forward only, figure out the list of objects that need to be transmitted -- but instead of transmitting the objects, change their ownership to ~git and then update ~git's refs. Since users can share local commits, maybe the ~git ownership change should happen at commit time. This all seems do-able without change in git; instead I'd add a few bash wrapper scripts (and see below for fsck and pack/prune). Another setup is like the previous, but make the central repo have its own hidden object repository. You would push to it using the standard git command. Finally, users could run git-fsck [with misleading output]; they could run git-prune{,-packed}, but these commands wouldn't be able to delete anything. If we don't want users to pack, then ~git/.git/objects/pack would be writable only by ~git. So basically, normal people wouldn't do the things in this paragraph. To do meaningful and safe fsck/prune on the shared repository as ~git, I'd add some scripting. If you require all users' GIT_DIR's to look like /home/USER/*/.git , then you can get all their refs and do a meaningful fsck. If not, you could do a fsck --unreachable as ~git and filter the result by date and/or type. (This sort of corresponds to abandoned changesets in perforce.) Once you have an fsck method you like, its filtered output (i.e., --unreachable objects you want to keep) can be fed to git-prune. Care would also be required with git-repack/git-prune-packed, but it seems mostly addressable with scheduling. If I proceed down this path, I'd like to implement this procedure without any change in git's .c or .sh files. It's clear this is a minority use and should not depend on anything being maintained for it inside git. I would write a few bash scripts and a README/HOWTO for possible inclusion in contrib. BTW, has anyone ever thought of writing an "Administrator's Manual" for git? Thanks, -- Dana L. How danahow@xxxxxxxxx +1 650 804 5991 cell - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html