On Thu, Dec 19, 2024 at 03:17:01AM -0800, Junio C Hamano wrote: > Boomman <boomman37@xxxxxxxxx> writes: > > > Yes, if the behavior in case of running out of disk space is to just > > leave the malformed file there, it stands to reason that cleaning up > > those malformed files should be the first operation to do for gc. > > It is misleading to call them malformed, isn't it? When a Git > process creates a packfile (or loose object file for that matter), > they are written under these tmp_* names. When the processes die > without finalizing these (either removing or renaming into their > final names), they are left behind, and it would be better if we can > remove it _before_ another process wants to consume more disk space. We usually automatically clean up our tempfiles if we encounter an error, but don't do so for partially written packs. I think this is mostly historical, though occasionally it can be useful for debugging (e.g., indexing a pack coming over the network). It might make sense to register them as tempfiles in the usual way, possibly with an environment variable option to ask for them to be kept (for debugging). That's not foolproof, since a process can die without cleaning up after itself (e.g., on system crash). But it would mean that a repeatedly failing "git repack -ad" does not fill up the disk. And the decision of when to clean up tempfiles in git-gc is less important. > But the issue is how you tell which one of these "malformed" files > are still being written and will be finalized, and which ones are > leftover ones. You want to remove the latter without molesting the > former. And you want to do so in a portable way, possibly even > across the network file systems. Yeah, I think there are two issues being discussed in this thread: - when to clean up leftover tempfiles - how to decide which tempfiles are leftover The second one is what the OP mentioned for locking. But not only does that have portability questions, I'm not sure it is sufficient. Would we ever write tmp_pack_*, complete our process, and then expect our caller to do something with it (meaning there's a race where no process is holding the lock)? I'm not sure. We definitely write "tmp" packfiles via pack-objects and expect git-repack to move them to their final names. I think we use a slightly different name ("tmp-<pid>-pack-*"), but arguably we should consider cleaning up stale versions of those, too. -Peff