Hey Peff, On 27 Sep 2022, at 7:44, Jeff King wrote: > On Mon, Sep 26, 2022 at 11:53:03PM +0000, John Cai via GitGitGadget wrote: > >> One place we call tmp_objdir_create() is in git-receive-pack, where >> we create a temporary quarantine directory "incoming". Incoming >> objects will be written to this directory before they get moved to >> the object directory. >> >> We have observed this code leading to a deadlock: >> >> Thread 1 (Thread 0x7f621ba0b200 (LWP 326305)): >> #0 __lll_lock_wait_private (futex=futex@entry=0x7f621bbf8b80 >> <main_arena>) at ./lowlevellock.c:35 >> #1 0x00007f621baa635b in __GI___libc_malloc >> (bytes=bytes@entry=32816) at malloc.c:3064 >> #2 0x00007f621bae9f49 in __alloc_dir (statp=0x7fff2ea7ed60, >> flags=0, close_fd=true, fd=5) >> at ../sysdeps/posix/opendir.c:118 >> #3 opendir_tail (fd=5) at ../sysdeps/posix/opendir.c:69 >> #4 __opendir (name=<optimized out>) >> at ../sysdeps/posix/opendir.c:92 >> #5 0x0000557c19c77de1 in remove_dir_recurse () >> #6 0x0000557c19d81a4f in remove_tmp_objdir_on_signal () >> #7 <signal handler called> > > Yuck. Your analysis looks right, and certainly opendir() can't really > work without allocating memory for the pointer-to-DIR. > >> To prevent this, add a flag REMOVE_DIR_SIGNAL that allows >> remove_dir_recurse() to know that a signal is being handled and avoid >> calling opendir(3). One implication of this change is that when >> signal handling, the temporary directory may not get cleaned up >> properly. > > It's not really "may not" here, is it? It will never get cleaned up on a > signal now. I don't think remove_dir_recursively() will try to rmdir() > in this case. But even if it did, we'll always have a "pack" > subdirectory (minus the small race before we've created it). > > That's unfortunate, but I don't think we really have a portable > alternative. We can't keep an exact list of files to be deleted, because > some of them will be created by sub-processes. We could perhaps exec a > helper that does the deletion, but that seems like a race and > portability nightmare. On Linux, we could probably use open() and > getdents64() to traverse, but obviously that won't work everywhere. It > _might_ be worth having some kind of compat/ wrapper here, to let > supported systems do as good a job as they can. But it's not like there > aren't already cases where we might leave the tmp-objdir directory > around (say, SIGKILL), so this is really just extending the existing > problem to more signals. > > I was going to suggest we should do a better job of cleaning up these > directories via git-gc. But it looks like b3cecf49ea (tmp-objdir: new > API for creating temporary writable databases, 2021-12-06) changed the > default name such that a regular git-gc should do so. So I think we're > covered there. I was wondering about this as well. Thanks for checking on this--that's reassuring. > > The main case we care about is normal exit when index-pack or a hook > sees an error, in which case we should still be cleaning up via the > atexit() handler. > > So I think your patch is going in the right direction, but... > >> static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) >> { >> - DIR *dir; >> + DIR *dir = NULL; >> struct dirent *e; >> int ret = 0, original_len = path->len, len, kept_down = 0; >> int only_empty = (flag & REMOVE_DIR_EMPTY_ONLY); >> @@ -3261,7 +3261,10 @@ static int remove_dir_recurse(struct strbuf *path, int flag, int *kept_up) >> } >> >> flag &= ~REMOVE_DIR_KEEP_TOPLEVEL; >> - dir = opendir(path->buf); >> + >> + if ((flag & REMOVE_DIR_SIGNAL) == 0) >> + dir = opendir(path->buf); >> + >> if (!dir) { >> if (errno == ENOENT) >> return keep_toplevel ? -1 : 0; > > We skip calling opendir() entirely, so "dir" will still be NULL. But we > immediately start looking at errno, which will have some undefined value > (based on some previous syscall). > > If we set "errno" to "EACCES" in this case, then we'd at least hit the > rmdir() below: > > if (!dir) { > if (errno == ENOENT) > return keep_toplevel ? -1 : 0; > else if (errno == EACCES && !keep_toplevel) > /* > * An empty dir could be removable even if it > * is unreadable: > */ > return rmdir(path->buf); > else > return -1; > } > > but we know it won't really do anything for our proposed caller, since > it will have files inside the directory that need to be removed before > rmdir() can work. yeah, I suppose the only case it would help is if the directory is already empty. > > Moreover, if you were to combine REMOVE_DIR_SIGNAL with > REMOVE_DIR_KEEP_NESTED_GIT, I suspect that the call to > resolve_gitlink_ref() would end up with similar deadlocks. Obviously > nobody is proposing to do that, but it is a pitfall in the API. > > So all of that makes me think we should not add a new flag here, but > instead just avoid calling the function entirely from > tmp_objdir_destroy_1(). > > But then we can observe that tmp_objdir_destroy_1() is basically doing > nothing if on_signal is set. So there is really no point in setting up > the signal handler at all. We should just set up the atexit() handler. > I.e., something like: > > diff --git a/tmp-objdir.c b/tmp-objdir.c > index a8be92bca1..10549e95db 100644 > --- a/tmp-objdir.c > +++ b/tmp-objdir.c > @@ -169,7 +169,6 @@ struct tmp_objdir *tmp_objdir_create(const char *prefix) > the_tmp_objdir = t; > if (!installed_handlers) { > atexit(remove_tmp_objdir); > - sigchain_push_common(remove_tmp_objdir_on_signal); > installed_handlers++; > } This makes sense and is a clean solution. I guess the only case where we would benefit in calling into remove_tmp_objdir_on_signal() is if the temp dir exists but is empty. I'm not sure how often this would happen to make it worth it? Probably not... > > > with the commit message explaining that we can't do the cleanup in a > portable and signal-safe way, so we just punt on the whole concept. > > There's also some minor cleanup we could do elsewhere to drop the > "on_signal" argument (which can come as part of the same patch, or on > top). > > -Peff