"Shawn O. Pearce" <spearce@xxxxxxxxxxx> writes: > Jeremy Maitin-Shepard <jbms@xxxxxxx> wrote: >> It is extremely cumbersome to have to worry about whether there are >> other concurrent accesses to the repository when running e.g. git gc. >> For servers, you may never be able to guarantee that nothing else is >> accessing the repository concurrently. Here is a possible solution: >> >> Each git process creates a log file of the references that it has >> created. The log file should be named in some way with e.g. the process >> id and start time of the process, and simply consist of a list of >> 20-byte sha1 hashes to be considered additional in-use references for >> the purpose of garbage collection. > I believe we partially considered that in the past and discarded it > as far too complex implementation-wise for the benefit it gives us. It doesn't seem all that complex, and I'd say that fundamentally it is the _correct_ way to do things. Being sloppy is always easier in the short run, but then either means the system is permanently broken or results in a lot of "fixing up" work later. I think almost all of the work of handling these log files could be done without impacting a lot of code that calls the relevant APIs that would actually use the log files. I think the biggest impact would be on non-C code, but even for that code, appropriate wrapper could be used to avoid having to make many changes. > The current approach of leaving unreachable loose objects around > for 2 weeks is good enough. Any Git process that has been running > for 2 weeks while still not linking everything it needs into the > reachable refs of that repository is already braindamaged and > shouldn't be running anymore. This sort of reasoning just leads to an inherently unreliable system. Sure, two weeks might seem good enough for nearly all cases, but why _shouldn't_ I be able to leave my editor open for two weeks before typing in my commit message and finishing the commit, or wait for two weeks in the middle of a rebase (it seems that in the new implementation, temporary refs are created basically to do the same thing as the log file I described.) I could easily be typing up my commit message, then switch to something else, and happen not to come back to it for two weeks. Because such a "timeout" based solution isn't really the "correct solution" but will work most of the time, potential problems won't be noticed while testing. Another significant issue is that this timeout means that unreferenced junk has to stay around in the repository for two weeks for no (good) reason. > If we are dealing with a pack file, those are protected by .keep > "lock files" between the time they are created on disk and the > time that the git-fetch or git-receive-pack process has finished > updating the refs to anchor the pack's contents as reachable. > Every once in a while a stale .keep file gets left behind when a > process gets killed by the OS, and its damn annoying to clean up. > I'd hate to clean up logs from every little git-add or git-commit > that aborted in the middle uncleanly. First of all, merely exiting due to an error should not cause log files to be left around. The only thing that should cause log files to be left around is kill -9 or a system crash. Second, by storing the process id and a timestamp of when the log file was created, it is possible to reliably determine if a log file is stale. -- Jeremy Maitin-Shepard -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html