On Wed, Mar 18, 2015 at 07:27:46PM -0700, Junio C Hamano wrote: > > I guess leaving a bunch of loose objects around longer than necessary > > isn't the end of the world. It wastes space, but it does not actively > > make the rest of git slower (whereas having a large number of packs does > > impact performance). So you could probably make "recent enough" be "T < > > now - gc.pruneExpire / 4" or something. At most we would try to gc 4 > > times before dropping unreachable objects, and for the default period, > > that's only once every couple days. > > We could simply prune unreachables more aggressively, and it would > solve this issue at the root cause, no? Yes, but not too aggressively. You mentioned object archaeology, but my main interest is avoiding corruption. The mtime check is the thing that prevents us from pruning objects being used for an operation-in-progress that has not yet updated a ref. For some long-running operations, like adding files to a commit, we take into account references like a blob being mentioned in the index. But I do not know offhand if there are other long-running operations that would run into problems if we shortened the expiration time drastically. Anything building a temporary index is potentially problematic. But if we assume that operations like that tend to create and reference their objects within a reasonable time period (say, seconds to minutes) then the current default of 2 weeks is absurd for this purpose. For raciness within a single operation, a few seconds is probably enough (e.g., we may write out a commit object and then update the ref a few milliseconds later). The potential for problems is exacerbated by the fact that object `X` may exist in the filesystem with an old mtime, and then a new operation wants to reference it. That's made somewhat better by 33d4221 (write_sha1_file: freshen existing objects, 2014-10-15), as before we could silently turn a file write into a noop. But it's still racy to do: git cat-file -e $commit git update-ref refs/heads/foo $commit as we do not update the mtime for a read-only operation like cat-file (and even if we did, it's still somewhat racy as prune does not atomically check the mtime and remove the file). So I think there's definitely some possible danger with dropping the default prune expiration time. For a long time GitHub ran with it as 1.hour.ago. We definitely saw some oddities and corruption over the years that were apparently caused by over-aggressive pruning and/or raciness. I've fixed a number of bugs, and things did get better as a result. But I could not say whether all such problems are gone. These days we do our regular repacks with "--keep-unreachable" and almost never prune anything. It's also not clear whether GitHub represents anything close to "normal" use. We have a much smaller array of operations that we perform (most objects are either from a push, or from a test-merge between a topic branch and HEAD). But we also have busy repos that are frequently doing gc in the background (especially because we share object storage, so activity on another fork can trigger a gc job that affects a whole repository network). On workstations, I'd guess most git-gc jobs run during a fairly quiescent period. All of which is to say that I don't really know the answer, and there may be dragons. I'd imagine that dropping the default expiration time from 2 weeks to 1 day would probably be fine. A good way to experiment would be for some brave souls to set gc.pruneexpire themselves, run with it for a few weeks or months, and see if anything goes wrong. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html