Re: git gc does not clean tmp_pack* files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Dec 19, 2024 at 03:17:01AM -0800, Junio C Hamano wrote:

> Boomman <boomman37@xxxxxxxxx> writes:
> 
> > Yes, if the behavior in case of running out of disk space is to just
> > leave the malformed file there, it stands to reason that cleaning up
> > those malformed files should be the first operation to do for gc.
> 
> It is misleading to call them malformed, isn't it?  When a Git
> process creates a packfile (or loose object file for that matter),
> they are written under these tmp_* names.  When the processes die
> without finalizing these (either removing or renaming into their
> final names), they are left behind, and it would be better if we can
> remove it _before_ another process wants to consume more disk space.

We usually automatically clean up our tempfiles if we encounter an
error, but don't do so for partially written packs. I think this is
mostly historical, though occasionally it can be useful for debugging
(e.g., indexing a pack coming over the network).

It might make sense to register them as tempfiles in the usual way,
possibly with an environment variable option to ask for them to be kept
(for debugging).

That's not foolproof, since a process can die without cleaning up after
itself (e.g., on system crash). But it would mean that a repeatedly
failing "git repack -ad" does not fill up the disk. And the decision of
when to clean up tempfiles in git-gc is less important.

> But the issue is how you tell which one of these "malformed" files
> are still being written and will be finalized, and which ones are
> leftover ones.  You want to remove the latter without molesting the
> former.  And you want to do so in a portable way, possibly even
> across the network file systems.

Yeah, I think there are two issues being discussed in this thread:

  - when to clean up leftover tempfiles

  - how to decide which tempfiles are leftover

The second one is what the OP mentioned for locking. But not only does
that have portability questions, I'm not sure it is sufficient. Would we
ever write tmp_pack_*, complete our process, and then expect our caller
to do something with it (meaning there's a race where no process is
holding the lock)?

I'm not sure. We definitely write "tmp" packfiles via pack-objects and
expect git-repack to move them to their final names. I think we use a
slightly different name ("tmp-<pid>-pack-*"), but arguably we should
consider cleaning up stale versions of those, too.

-Peff




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux