Re: git status --ignored hangs when a deep directory structure present in working tree

SZEDER Gábor <szeder.dev@xxxxxxxxx> · Tue, 28 Jan 2020 14:57:07 +0100

On Mon, Jan 27, 2020 at 09:06:01PM -0800, Elijah Newren wrote:
> > the runtime of 'git status
> > --ignored' grows quickly with the depth of the untracked directory.
> > Running this shell loop produces the numbers below:
> >
> > for depth in $(seq 10 30)
> > do
> >         dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done)
> >         rm -rf dir
> >         mkdir -p $dirs
> >         >$dirs/untracked-file
> >         /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null
> > done
> >
> > 10: 0.01
> > 11: 0.03
> > 12: 0.05
> > 13: 0.11
> > 14: 0.23
> > 15: 0.47
> > 16: 0.97
> > 17: 1.97
> > 18: 3.88
> > 19: 7.85
> > 20: 16.29
> > 21: 32.92
> > 22: 76.24
> 
> Wow.
> 
> Really nice testcase, though, thanks.
> 
> > Beautifully quadratic, isn't it? :)
> 
> I think you mean exponential (in particular 2^n rather than n^2).

Ouch, yeah, indeed.

> > Unless I messed up my numbers, with a depth of 120 directories it
> > would take over 6*10^23 years to complete... so yeah, it does qualify
> > as indefinitely.
> 
> No comment about how people today are spoiled and addicted to instant
> gratification?  I mean, can't folks just be a little patient?  ;-)

Nope.  Notice how my shell loop above goes to 30, but the results only
to 22 :)

> > This slowdown was caused by commit df5bcdf83a (dir: recurse into
> > untracked dirs for ignored files, 2017-05-18), which was part of a
> > patch series to fix 'git clean -d' deleting untracked directories even
> > if they contained ignored files.
> >
> > Cc'ing Samuel, author of that commit, and Elijah, who had quite some
> > fun with 'dir.c' recently.
> 
> Heh, yes, what "fun" it was.
> 
> Anyway, after digging around for quite a bit today... that commit
> added calling read_directory_recursive() directly from itself for
> certain untracked paths.  This means that read_directory_recursive()
> (which I'll abbreviate to r_d_r()), when we're dealing with certain
> untracked paths:
> 
>   * Calls treat_path() -> treat_one_path() -> treat_directory() -> r_d_r()
>   * Calls r_d_r() directly as well
> 
> So, from the toplevel directory, r_d_r() will call itself twice on the
> next directory down.  For each of those, it'll call r_d_r() twice on
> the second directory down.  From each of those, it'll call r_d_r()
> twice on the third directory in the hierarchy and so on until we have
> 2^n calls to r_d_r() for the nth level deep directory.

Got it, thanks.

> Trying to back out the underlying problem, I _think_ the cause behind
> all this is that r_d_r() and friends all use the path_treatment enum,
> which says that the treatment of any path has to be one of those four
> types. The dichotomy between path_untracked and path_recurse in
> particular means the system has no way to mark that something should
> be both marked as untracked and recursed into, yet we definitely need
> to have some directories marked as untracked and we also need to
> recurse into them.  This, I think led to Samuel's attempt to
> workaround that dichotomy by having the code in r_d_r() check for
> certain path_untracked cases which should also be recursed into.  I
> think adding another type to the enum and shifting the logic elsewhere
> might enable us to both simplify the logic and avoid this expensive
> exponential behavior, but I haven't gotten it working yet.  We'll see
> if my patch goes anywhere, or if it's just another dead-end among
> many.

I was wondering whether it would make sense to give the enum contants
power-of-two values, so we could say 'path_recurse | path_untracked'.
But while this particular combination makes sense, others, at least to
my superficial understanding, not at all (e.g. 'path_recurse |
path_exclude').