Hi, On Tue, Jun 22, 2021 at 2:25 AM Tao Klerks <tao@xxxxxxxxxx> wrote: > > Hi folks, > > I'm hoping for a little help understanding *intent* around a > particular code comment in dir.c, and reaching out to the whole list Oh, boy, dir.c. We should probably put a simple warning at the top of the file, something like "Abandon all hope, all ye who enter here". > because someone (Junio?) said they consider any mail that *doesn't* > copy the list to be spam anyway, and the original author of the > comment in question (Duy Nguyen) is no longer active on git. > > Context: > I am trying to explore how to get "--untracked-files=all" to play nice > with the untracked cache, so that windows users using tooling that > sets "--untracked-files=all" can benefit from the same > orders-of-magnitude git status performance improvements as commandline > users. > > There is a "naive" approach to this (store the untracked cache in the > index file with whatever dir flags were specified/used in the > recursive walk, and ignore/rewrite that cached data every time the > flags change), and there is presumably a more "comprehensive" approach > (store all the information required in the untracked cache to be able > to satisfy requests with either set of flags - even if this is a > little more expensive on first run). > > The main disadvantage of the "naive" approach is that is every time > you flip-flop between "git status" and "git status -u", the untracked > cache is ignored, a recursive directory walk ensues, and the untracked > cache is rewritten to the index file for the next time you rerun, > hopefully with the same flags. However, I would think in most > situations flip-flopping will be less common - more commonly you're > using a tool or workflow that ends up running the same command(s) > repeatedly... At least, that's my thesis. I would put this "store the > untracked cache every time dir flags change" behavior behind a config > switch, anyway. > > This "naive" approach *is* rather easy to achieve - you just need to > recreate a new "untracked" structure inside > dir.c#validate_untracked_cache() if you find a mismatch of flags (and > make other small fixes to store the correct flags in that new > "untracked" structure). > > The one thing that sticks out after making these changes is a code > comment first introduced in 2015 by Duy Nguyen in ccad261f, explaining > *why* we refuse to use the untracked cache with "-uall": > > * See treat_directory(), case index_nonexistent. Without > > * this flag, we may need to also cache .git file content > > * for the resolve_gitlink_ref() call, which we don't. > > I've seen this comment many times over the past few months, and I've > previously always interpreted it as a "correctness" concern. > > Looking at the original code in question, however (as of ccad261f, > dir.c#treat_directory(), "resolve_gitlink_ref" call), I don't > understand how correctness could be impacted. > > Fast-forward 6 years and all this code has been substantially > overhauled by several folks over the years (most recently and majorly > Elijah Newren), and the "resolve_gitlink_ref()" call is long-gone. I didn't remember removing any resolve_gitlink_ref() calls, but a quick search (`git log -Sresolve_gitlink_ref -p -- dir.c`) suggests it was removed by b22827045e0b (dir: do not traverse repositories with no commits, 2019-04-09). A brief look at that commit message certainly suggests things have changed in a way that _might_ make the original comment irrelevant. > Does this look familiar to anyone? Is there any remaining obvious > reason to be wary of storing the untracked cache structure produced > with '-uall'? Untracked cache is its own beast. I fought with it a bunch, but it's been a while. (See e.g. https://lore.kernel.org/git/CABPp-BEWhavmtDi-ahT+QMWtD6Fe-Ey7cn_82nbetWEQJL=hRA@xxxxxxxxxxxxxx/) I can say that untracked traversal had all sorts of weirdness and breakage, which could have made untracked cache harder to make right and consistent particularly with -uall. See for example 8d92fb292706 (dir: replace exponential algorithm with a linear one, 2020-04-01) and perhaps aa6e1b21e5de (dir: avoid unnecessary traversal into ignored directory, 2021-05-12). In particular, the exponential traversal (paths at depth N might be traversed 2^N times) from the former commit combined with the fact that later traversals often returned a different status for a path than the first traversal certainly could have caused some additional weirdness for the untracked cache. It may be that some combination of Kyle's removal of the resolve_gitlink_ref() and my removal of the exponential traversal of untracked paths make your idea safe...but this is dir.c and untracked-cache, so who really knows? Not sure if that helps, but I hope one or two pointers I provided are somehow meaningful to you. Elijah