Re: mt/dir-iterator-updates, was Re: What's cooking in git.git (Jul 2019, #01; Wed, 3)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 4, 2019 at 6:30 PM Johannes Schindelin
<Johannes.Schindelin@xxxxxx> wrote:
>
> Hi Matheus,
>
> On Thu, 4 Jul 2019, Matheus Tavares Bernardino wrote:
> >
> > I wanted to take a look at the failures to see if I could help, [...]
> > Could you point me to the right place, please?
[...]
>
> I usually click on the "Tests" tab in that page:
> https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11495&view=ms.vss-test-web.build-test-results-tab
>
> You can click on any of the 1,384 (!) failing test cases, it will pop up a
> pane on the right-hand side that shows the trace of that failing test
> case. For the full trace of that test script, go to "Attachments" and
> download the `Standard_Error_Output.log` (via the horizontal bread-crumbs
> menu you can see when hovering over the file name).

Thanks for the explanation! I inspected some of the
`Standard_Error_Output.log`'s and it seems the problem is always with
local clone (which started to use dir-iterator in this series). It
seems all .git/objects/ dirs are being ignored. That makes sense,
since st_ino will always be 0 on Windows. But your fixup patch should
solve this. Is there any azure build for it?

[...]
> >
> > Hm, I think `stat()` itself uses this strategy of an arbitrary cut-off
> > when resolving a path. So we may also "ignore" circular symlinks and
> > let the iteration continue until the point where `stat()` will return
> > an ELOOP. (But it may be expensive...)
>
> This would not be equivalent, though, as your code also tried to address
> circular references when two or more symlinks are involved, e.g. when
> one symlink points to a directory that has another symlink that points to
> the directory containing the first symlink.

Hm, `stat()` also addresses this case doesn't it? For example:

$ mkdir a b
$ ln -s ../a b/s2a
$ ln -s ../b a/s2b
$ stat b/s2a/s2b/s2a/.../s2b

Gives me:
"too many levels of symbolic links"

> > > Do we _have_ to, though? At some stage the path we come up with is beyond
> > > `PATH_MAX` and we can stop right then and there.
> > >
> > > Besides, the way `find_recursive_symlinks()` is implemented adds quadratic
> > > behavior.
> >
> > Yes, indeed. But it only happens when we have a path like this:
> > `symlink_to_dir1/symlink_to_dir2/symlink_to_dir3/symlink_to_dir4/...`,
> > right? I think this case won't be very usual on actual filesystems,
> > thought.
>
> No, the check is performed in a loop, and that loop is executed whether
> you have symlinks or not. That loop is executed for every item in the
> iteration, and as we cannot assume a flat directory in general (in fact,
> we should assume a directory depth proportional to the total number of
> files), that adds that quadratic behavior.

Oh, you're right. Sorry for the noise, I forgot this function was not
only called for symlinks but for every directory entry :(

An alternative solution would be to use `lstat()` together with
`stat()` to only feed symlinked dirs to this function. This should
reduce a lot the number of calls. Although it'd still be quadratic in
the worst case, would that be any good?

[...]
> > > But I also think there are enough
> > > reasons to do away with this function in the first place.
> >
> > We can delegate the circular symlinks problem to `stat()'s ELOOP`
>
> Not really. I mean, we _already_ delegate to the `ELOOP` condition, we
> cannot even avoid it as long as we keep using `stat()` instead of
> `lstat()`

Yes, indeed. What I meant is that we may chose to _fully_ delegate to
ELOOP. The way it is now, we should detect circular symlinks way
before stat() fails. For example, with the case I showed above, we
would stop at "b/s2a/s2b" whereas stat() would only fail at a much
longer "b/s2a/s2b/s2a/s2b/...", far beyond in the iteration.

> but as I demonstrated above, that only catches part of the
> circular symlinks.
>
> > or `path_len > PATH_MAX`.
>
> This would have the advantage of _not_ adding quadratic behavior.
>
> And just in case you think quadratic behavior would not matter much: Git
> is used to manage the largest repository on this planet, which has over 3
> million index entries when checked out.
>
> Quadratic behavior matters.
>
> I don't know where the dir-iterator is used, but we simply should try our
> best to aim for the optimal time complexity in the first place.

Currently, with the follow symlinks option, dir-iterator is only being
used to iterate over .git/objects. As it's rather shallow, perhaps the
quadratic complexity wouldn't be a huge deal in this case. But I agree
with you that we should take care of performance so that this API may,
as well, be used in other places, in the future.

> > The only downside is the overhead of iterating through directories which
> > will be latter discarded for being in circular symlinks' chains. I mean,
> > the overhead at dir-iterator shouldn't be much, but the one on the code
> > using this API to do something for each of these directories (and its
> > contents), may be. Also we would need to "undo" the work done for each
> > of these directories if we want to ignore circular symlinks and continue
> > the iteration, whereas if we try to detect it a priori, we can skip it
> > from the beginning.
>
> Given that the intent of this patch series is a mere refactoring, I wonder
> why the new, contentious circular symlink detection is slipped into it
> anyway. It complicates the task, makes reviewing a lot harder, and it
> holds up the refactoring.

Yes, I apologize for that. I should have split this into 2 or maybe 3
patchsets... This series started really simple just trying to apply
the dir-iterator API into local clone. But then, other things became
necessary and it got more complex.

So, should I send a fixup patch removing find_recursive_symlinks() or
reroll the series? There's also the option to use stat()+lstat() to
reduce calls to this function, but it doesn't solve the problem on
Windows, anyway.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux