Re: [PATCH] builtin/fetch.c: clean tmp pack after receive signal

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Mar 16, 2021 at 02:53:36AM +0000, SURA via GitGitGadget wrote:

> In Gitee.com, I often use scripts to start a time-limited

Not related to your patch, but I think this name falls afoul of Git's
trademark policy. See:

  https://git-scm.com/trademark

There's also some discussion in this thread:

  https://lore.kernel.org/git/20170202022655.2jwvudhvo4hmueaw@xxxxxxxxxxxxxxxxxxxxx/

> "git fetch". When the execution time of'git fetch' is too
> long, send a signal (such as SIGINT) to the'git fetch' process
> 
> But I found that after the process exits, there are some
> files in the format of 'objects/pack/tmp_pack_XXXXXX' on the disk.
> They are usually very large (some of them exceed 5G), and'git gc'
> has no effect on them.

This isn't quite true. "git gc" will clean up the temporary files, but
only if the mtime is sufficiently old. The purpose here is to give a
grace period to avoid deleting a file that is actively being written to.
However, we use the same grace period that we use for deleting
unreachable objects, which is absurdly long for this purpose: 2 weeks.
Probably something like an hour would be more appropriate (since the
mtime is updated on each write, this would imply a process not making
forward progress).

That said...

> Obviously this is only a temporary solution, I think it should be fixed from git
> 
> I found fetch will start a 'index-pack' sub-process, this sub-process
> create the tmp_pack
>   1. make 'index-pack' unlink tmp_pack when get signal
>   2. make 'fetch' send signal to 'index-pack' when get signal

I do think this is worth doing. One of the reasons we haven't
traditionally cleaned up temporary packfiles is that the failed state is
sometimes useful for analyzing what went wrong, or even recovering
partial data. But that probably should not be the default mode of
operation (i.e., a config option or environment variable should probably
be able to turn it on for debugging).

Looking at the implementation itself...

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 0b90de87c7a2..a87efa23ceb5 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -224,8 +224,18 @@ static void unlock_pack(void)
>  		transport_unlock_pack(gsecondary);
>  }
>  
> +static void send_signo_to_index_pack(int signo)
> +{
> +	if (gtransport && gtransport->index_pack_pid > 0)
> +		kill(gtransport->index_pack_pid, signo);
> +
> +	if (gsecondary && gsecondary->index_pack_pid > 0)
> +		kill(gsecondary->index_pack_pid, signo);
> +}
> +
>  static void unlock_pack_on_signal(int signo)
>  {
> +	send_signo_to_index_pack(signo);
>  	unlock_pack();
>  	sigchain_pop(signo);
>  	raise(signo);

We'd probably want to also trigger this if we just hit die(), I'd think.
We have a system for killing process children when we exit unexpectedly.
I think just setting the "clean_on_exit" flag on the index-pack
child_process struct would turn this into a one-liner.

Likewise, we'd probably want to do this from receive-pack, too (so that
pushes don't accumulate temporary packfiles on the receiving side).

> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index bad57488079c..27d1ebba746e 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> [...]
> +static void remove_tmp_pack(void)
> +{
> +	if (tmp_pack_name) 
> +		unlink_or_warn(tmp_pack_name);
> +}
> +
> +static void remove_tmp_pack_on_signal(int signo)
> +{
> +	remove_tmp_pack();
> +	sigchain_pop(signo);
> +	raise(signo);
> +}

Likewise, we have a tempfile cleanup system already.

I think this hunk:

> @@ -336,6 +339,7 @@ static const char *open_pack_file(const char *pack_name)
>  			output_fd = odb_mkstemp(&tmp_file,
>  						"pack/tmp_pack_XXXXXX");
>  			pack_name = strbuf_detach(&tmp_file, NULL);
> +			tmp_pack_name = pack_name;

...can just call register_tempfile(). It should also record the result
so that we don't try to unlink() it after we've already moved it away
from its temporary name (though it's fairly unlikely for somebody else
to have used the name in the interim).

I think you'd want to do the same for the tmp_idx_* files, too. Likewise
for ".rev" files we create starting in v2.31.

I think it would also make sense in create_tmp_packfile(), which is used
during repacking (a different problem space, but really the same thing:
if repacking fails for some reason, we probably shouldn't leave a
useless gigantic half-finished packfile on disk).

We should possibly also do so for tmp_obj_* files. Those can be written
for a fetch or push via unpack-objects (as well as normal local
commands). They're not usually as big as a pack, obviously, but I think
the same principle applies.

> [...]

It would be nice to see some tests covering this functionality, too.
Reproducing it with signals is likely to be racy and not worth it. But I
think that right now index-pack reading a bogus pack (say, one that
fails fsck checks) will leave the tmp_pack_* on disk. And it would not
if we cleanup tempfiles (again, this would be on any exit, not just
signal death, but I think that is what we'd want, and also what
register_tempfile() will do).

-Peff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux