On Mon, Jan 27, 2020 at 09:06:01PM -0800, Elijah Newren wrote: > > the runtime of 'git status > > --ignored' grows quickly with the depth of the untracked directory. > > Running this shell loop produces the numbers below: > > > > for depth in $(seq 10 30) > > do > > dirs=$(for i in $(seq 1 $depth) ; do printf 'dir/' ; done) > > rm -rf dir > > mkdir -p $dirs > > >$dirs/untracked-file > > /usr/bin/time --format="$depth: %e" git status --ignored >/dev/null > > done > > > > 10: 0.01 > > 11: 0.03 > > 12: 0.05 > > 13: 0.11 > > 14: 0.23 > > 15: 0.47 > > 16: 0.97 > > 17: 1.97 > > 18: 3.88 > > 19: 7.85 > > 20: 16.29 > > 21: 32.92 > > 22: 76.24 > > Wow. > > Really nice testcase, though, thanks. > > > Beautifully quadratic, isn't it? :) > > I think you mean exponential (in particular 2^n rather than n^2). Ouch, yeah, indeed. > > Unless I messed up my numbers, with a depth of 120 directories it > > would take over 6*10^23 years to complete... so yeah, it does qualify > > as indefinitely. > > No comment about how people today are spoiled and addicted to instant > gratification? I mean, can't folks just be a little patient? ;-) Nope. Notice how my shell loop above goes to 30, but the results only to 22 :) > > This slowdown was caused by commit df5bcdf83a (dir: recurse into > > untracked dirs for ignored files, 2017-05-18), which was part of a > > patch series to fix 'git clean -d' deleting untracked directories even > > if they contained ignored files. > > > > Cc'ing Samuel, author of that commit, and Elijah, who had quite some > > fun with 'dir.c' recently. > > Heh, yes, what "fun" it was. > > Anyway, after digging around for quite a bit today... that commit > added calling read_directory_recursive() directly from itself for > certain untracked paths. This means that read_directory_recursive() > (which I'll abbreviate to r_d_r()), when we're dealing with certain > untracked paths: > > * Calls treat_path() -> treat_one_path() -> treat_directory() -> r_d_r() > * Calls r_d_r() directly as well > > So, from the toplevel directory, r_d_r() will call itself twice on the > next directory down. For each of those, it'll call r_d_r() twice on > the second directory down. From each of those, it'll call r_d_r() > twice on the third directory in the hierarchy and so on until we have > 2^n calls to r_d_r() for the nth level deep directory. Got it, thanks. > Trying to back out the underlying problem, I _think_ the cause behind > all this is that r_d_r() and friends all use the path_treatment enum, > which says that the treatment of any path has to be one of those four > types. The dichotomy between path_untracked and path_recurse in > particular means the system has no way to mark that something should > be both marked as untracked and recursed into, yet we definitely need > to have some directories marked as untracked and we also need to > recurse into them. This, I think led to Samuel's attempt to > workaround that dichotomy by having the code in r_d_r() check for > certain path_untracked cases which should also be recursed into. I > think adding another type to the enum and shifting the logic elsewhere > might enable us to both simplify the logic and avoid this expensive > exponential behavior, but I haven't gotten it working yet. We'll see > if my patch goes anywhere, or if it's just another dead-end among > many. I was wondering whether it would make sense to give the enum contants power-of-two values, so we could say 'path_recurse | path_untracked'. But while this particular combination makes sense, others, at least to my superficial understanding, not at all (e.g. 'path_recurse | path_exclude').