Junio C Hamano <gitster@xxxxxxxxx> writes: > Nicolas Pitre <nico@xxxxxxx> writes: >> On Fri, 9 May 2008, Brandon Casey wrote: >> >>> Unreferenced objects are sometimes used by other repositories which have >>> this repository listed as an alternate. So it may not be a good idea to >>> make the unreferenced objects inaccessible. >> >> Nah. If this is really the case then you shouldn't be running gc at all >> in the first place. > True. > I think the true motivation behind --keep-unreachable is not about the > shared object store (aka "alternates") but about races between gc and > push (or fetch). Before push (or fetch) finishes and updates refs, the > new objects they create would be dangling _and_ the objects these dangling > objects refer to may be packed but unreferenced. Repacking unreferenced > packed objects was a way to avoid losing them. I feel like the current approach of (not very well) keeping track of which objects are still needed is very messy, not very well defined or based on specific solid principles, and prone to errors and losing objects. Things like git clone -shared can only really be used in extremely specialized setups, or if pruning of unreferenced objects is completely disabled in the source repository, or if specialized scripts are used to do the garbage collection that take into account the references of the "child" repository. It is my impression that even repo.or.cz, while it has some safe guards, does not even completely safely handle garbage collection. Probably it would be very useful to examples of such scripts in contrib. I think that ultimately, some general purpose and reliable solution needs to be found to handle the cases of (1) a repository having its objects referenced by another via info/alternates; (2) a repository with multiple working directories (presumably this should warn/error out unless given a force option/detach head and warn if you try to switch HEAD for some working directory to the same branch as some other working directory). It seems, btw, that a third type of clone, one which merely symlinks the objects directory, would also be useful, once there is a solution to the robustness issue. This would be a case (3) that needs to be handled as well. It seems that clear that ultimately, to handle these three cases, every repository needs to know about every other repository, probably via a symlink to other repository's .git directory. Git gc would then also examine any refs in this directory, making sure to avoid circular references that might result from following the symlinks. It should also probably error out if it finds a symlink that doesn't point to a valid git repository, because such a symlink either refers to a now-deleted repository for which the symlink needs to be cleaned up, or it refers to a repository that was moved and therefore the symlink needs to be updated. Simply ignoring invalid symlinks could result in pruning objects that need to be kept for repositories that have moved. It is extremely cumbersome to have to worry about whether there are other concurrent accesses to the repository when running e.g. git gc. For servers, you may never be able to guarantee that nothing else is accessing the repository concurrently. Here is a possible solution: Each git process creates a log file of the references that it has created. The log file should be named in some way with e.g. the process id and start time of the process, and simply consist of a list of 20-byte sha1 hashes to be considered additional in-use references for the purpose of garbage collection. The log file would be cleaned up when the process exits, and would also be deleted by any instance of git gc that notices a stale log file that doesn't correspond to a running process. To handle shell scripts that need to deal with git-hash-object directly, git hash-object could be passed maybe a file descriptor or filename of a log file to use instead of creating one. Maybe the log file format could be more complicated, and also support paths to e.g. alternate index files to also consider for references. Things would need to be one so that race conditions do not occur, but I think something like this would work. -- Jeremy Maitin-Shepard -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html