Re: [PATCH v5 2/2] untracked-cache: support '--untracked-files=all' if configured

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



 the

On Tue, Mar 29, 2022 at 7:43 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> "Tao Klerks via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
>
> > From: Tao Klerks <tao@xxxxxxxxxx>
> >
> > Untracked cache was originally designed to only work with
> > '--untracked-files=normal', but this causes performance issues for UI
> > tooling that wants to see "all" on a frequent basis. On the other hand,
> > the conditions that prevented applicability to the "all" mode no
> > longer seem to apply.
>
> There is a slight leap in logic flow before ", but this causes" that
> forces readers read the above twice and guess what is missing.  I am
> guessing that
>
>     ... was designed to only work with "--untracked-files=normal",
>     and is bypassed when "--untracked-files=all" request, but this
>     causes ...
>
> is what you meant to say.

Yep, will fix.

>
> The claim in the last sentence "... no longer seem to apply" is
> thrown at the readers without much rationale.  Either justify it
> more solidly or discard the claim?
>

I can add references to previous conversation threads around the
topic, and add the word "experimentally".

> > The disqualification of untracked cache has a particularly significant
> > impact on Windows with the defaulted fscache, where the introduction
> > of fsmonitor can make the first and costly directory-iteration happen
> > in untracked file detection, single-threaded, rather than in
> > preload-index on multiple threads. Improving the performance of a
> > "normal" 'git status' run with fsmonitor can make
> > 'git status --untracked-files=all' perform much worse.
>
> The last sentence is a bit hard to parse and very hard to reason
> about.  Do you mean to say "--untracked-files=all is slower when the
> untracked cache is in use, so the performance of 'git status' may be
> improved for '--untracked-files=normal' when the untracked cache is
> used, it hurts when 'git status --untracked-files=all' is run"?

Something like that. The case where untracked cache bypassing (or
disabling) has a huge impact is on windows, and involves fscache and
fsmonitor:
* Untracked cache saves some work under qualifying conditions
* If qualifying conditions are not met (eg when -uall), we do the work
anyway as though untracked cache were disabled.
* fscache optimises directory-iteration on windows, with an N-thread
preload strategy. This is enabled by default, and improves windows
"git status" performance transparently from "abominably slow" to just
"slow".
* fsmonitor, when combined with untracked cache, eliminates recursive
directory-iteration in favor of using a persistent in-memory model of
what's changed on disk (to compare to the index and untracked cache)
* When fsmonitor is enabled and used, the fscache preload strategy is
(obviously) disabled
* When fsmonitor is enabled and untracked cache is disabled or
bypassed, the otherwise-ubiquitous "fscache preload" does not happen,
and the untracked file search (directory iteration) takes... forever.

In other words: In windows, where fscache behavior is critical for
large repos and defaulted-on, when fsmonitor is also enabled which
generally further improves things but disqualifies fscache preload,
the bypassing of untracked cache causes a huge performance impact. A
"-uall" run can take a minute, where a "-unormal" run takes only a
second, because the enumeration of untracked files without using the
untracked cache data means a full directory iteration, not optimized
by fscache. On the other hand, removing fsmonitor and untracked cache
altogether, and leaving just the fscache optimization to do its job
consistently, results in consistent middle-of-the-road performance in
the order of 10 seconds in this example.

Enabling fsmonitor in this situation causes -uall to perform
pathologically badly, because of untracked cache bypassing
(interacting with the fscache-preload-bypassing of fsmonitor).
However, the benefits of dropping status time from 10 seconds to 1
second make fsmonitor very attractive. On the other hand, the
pathological performance with -uall is a problem if you use GUIs that
*require* -uall. Therefore, fixing untracked cache to be compatible
with -uall is critical to large repos on windows where the ideal
experience would be fscache working, fsmonitor working, untracked
cache working, and GUIs able to use -uall without the user waiting a
minute or more and sometimes timing out.

This problem sounds like an edge-case when I describe the chain of
interactions, but is actually very common among large-repo windows
users. Windows users are typically GUI users. GUIs typically use
-uall, as git's default "hide contents of untracked folders" behavior
is obviously limiting.

This patch doesn't actually change anything I've described by default
- but it makes it *possible* to get good/consistent performance with
-uall (with fscache, with fsmonitor, with untracked cache), *if* you
configure status.showuntrackedfiles=all.

Now, coming back to the paragraph in question; maybe, instead of
trying to summarize this windows-fscache-fsmonitor-untrackedcache
interaction, I should be a little more vague. Would this work?

--
When 'git status' runs without using the untracked cache, on a large
repo, on windows, with fsmonitor, it can run very slowly. This can
make GUIs that need to use "-uall" (and therefore currently bypass
untracked cache) unusable when fsmonitor is enabled, on such large
repos.
--


> Let me follow what actually happens in the patch aloud.
>
[...]
> This _guesses_ the user preference from the configuration.
>

Yes - the proposal, in this patch, is that the untracked cache
usability/compatibility be aligned with the configuration. We are
adding an extra meaning to the "status.showuntrackedfiles"
configuration, such that it not only indicates what you want to have
happen when you don't specify "-u" to git status, but it *also*
indicates what kind of status output the untracked cache should be
targeting/supporting.

> > +static void new_untracked_cache(struct index_state *istate, int flags)
> >  {
> >       struct untracked_cache *uc = xcalloc(1, sizeof(*uc));
> >       strbuf_init(&uc->ident, 100);
> >       uc->exclude_per_dir = ".gitignore";
> > -     /* should be the same flags used by git-status */
> > -     uc->dir_flags = DIR_SHOW_OTHER_DIRECTORIES | DIR_HIDE_EMPTY_DIRECTORIES;
> > +     uc->dir_flags = flags >= 0 ? flags : new_untracked_cache_flags(istate);
>
> We use unsigned for the flag word unless there is a reason not to,
> but this function wants to use top-bit as a signal to "guess from
> config".  The caller either dictates what bits to set, or we guess.
>
> And the created uc records the flags.
>

Yep - although the word "guess" may be slightly misleading here. The
proposed semantics are such that if the existing untracked cache flags
are inconsistent with the config, we *discard* the
no-longer-consistent-with-config untracked cache; so it's not so much
a guess as a mandate.

> > @@ -2762,11 +2782,11 @@ static void new_untracked_cache(struct index_state *istate)
> >  void add_untracked_cache(struct index_state *istate)
> >  {
> >       if (!istate->untracked) {
> > -             new_untracked_cache(istate);
> > +             new_untracked_cache(istate, -1);
>
> We are newly creating, so "-1" (guess from config) may be the most
> appropriate (unless the caller of add_untracked_cache() already
> knows what operation it is using for, that is).
>
> >       } else {
> >               if (!ident_in_untracked(istate->untracked)) {
>
> Found an existing one but need to recreate.
>
> >                       free_untracked_cache(istate->untracked);
> > -                     new_untracked_cache(istate);
> > +                     new_untracked_cache(istate, -1);
>
> In this case, is it likely to give us a better guess to read the
> configuration, or copy from the existing untracked file?  My gut
> feeling says it would be the latter, and if that is correct, then
> we might want
>
> +                       int old_flags = istate->untracked->dir_flags;
>                         free_untracked_cache(istate->untracked);
> -                       new_untracked_cache(istate);
> +                       new_untracked_cache(istate, old_flags);
>
> instead?  I dunno.
>

As I noted above, we later consider that an untracked cache with flags
that are inconsistent with the current config is an invalid untracked
cache, and discard it; so setting the flags based on config is the
right thing (the only useful thing) to do.


> > @@ -2781,9 +2801,9 @@ void remove_untracked_cache(struct index_state *istate)
> >  }
> >
> >  static struct untracked_cache_dir *validate_untracked_cache(struct dir_struct *dir,
> > -                                                   int base_len,
> > -                                                   const struct pathspec *pathspec,
> > -                                                   struct index_state *istate)
> > +                                                         int base_len,
> > +                                                         const struct pathspec *pathspec,
> > +                                                         struct index_state *istate)
> >  {
> >       struct untracked_cache_dir *root;
> >       static int untracked_cache_disabled = -1;
>
> Is this just re-indenting?  Not very welcome, but OK.
>

Will fix; I think there was a previous version in which these lines
actually changed.

> > @@ -2814,17 +2834,9 @@ static struct untracked_cache_dir *validate_untracked_cache(struct dir_struct *d
> >       if (base_len || (pathspec && pathspec->nr))
> >               return NULL;
> >
> > -     /* Different set of flags may produce different results */
> > -     if (dir->flags != dir->untracked->dir_flags ||
> > -         /*
> > -          * See treat_directory(), case index_nonexistent. Without
> > -          * this flag, we may need to also cache .git file content
> > -          * for the resolve_gitlink_ref() call, which we don't.
> > -          */
> > -         !(dir->flags & DIR_SHOW_OTHER_DIRECTORIES) ||
> > -         /* We don't support collecting ignore files */
> > -         (dir->flags & (DIR_SHOW_IGNORED | DIR_SHOW_IGNORED_TOO |
> > -                        DIR_COLLECT_IGNORED)))
> > +     /* We don't support collecting ignore files */
> > +     if (dir->flags & (DIR_SHOW_IGNORED | DIR_SHOW_IGNORED_TOO |
> > +                     DIR_COLLECT_IGNORED))
> >               return NULL;
> >
> >       /*
> > @@ -2847,6 +2859,41 @@ static struct untracked_cache_dir *validate_untracked_cache(struct dir_struct *d
> >               return NULL;
> >       }
> >
> > +     /*
> > +      * We don't support using or preparing the untracked cache if
> > +      * the current effective flags don't match the configured
> > +      * flags.
> > +      */
>
> Is that what we want?  What happens when your user does this:
>
>     - set showuntrackedfiles to "normal"
>     - generate the untracked cache
>     - set showuntrackedfiles to "all"
>
> And now the user wants to make a request that wants to see flags
> that are suitable for "normal".
>
> The best case would be for the untracked cache to know what flags
> were in use when it was generated, so that in the above sequence,
> even though the current value of configuration variable and the
> current request from the command line contradict each other, we
> can notice that the untracked cache data is suitable for the current
> request and the right thing happens.
>
> > +     if (dir->flags != new_untracked_cache_flags(istate))
> > +             return NULL;
>
> But this only pays attention to what is in the configuration?  It
> seems to me that it is being too pessimistic, but perhaps it is
> necessary for correctness somehow?

The logic in the current patch is:
* If the configuration doesn't match the request, ignore untracked
cache completely (exit the UC logic)
* If the configuration doesn't match the current untracked cache data,
discard the existing untracked cache data (and later recreate it if
everything else works out)

I think your proposal is something like:
* If the current untracked cache data doesn't match the request
** If the current untracked cache data doesn't match the
configuration, then discard the existing untracked cache (and later
recreate it if everything else works out)
** Otherwise, ignore untracked cache completely (exit the UC logic)
* (otherwise, if current untracked cache data *does* match the
request, assume it's good enough to keep using for now - this is the
new behavior)

I find the latter slightly harder to understand and explain, but I'm
reasonably certain it addresses the case you identified above without
compromising correctness in any other cases - it can only have a
positive impact on performance. I'll have a go at making that change
today.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux