Re: mt/dir-iterator-updates, was Re: What's cooking in git.git (Jul 2019, #01; Wed, 3)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Matheus,

On Thu, 4 Jul 2019, Matheus Tavares Bernardino wrote:

> On Thu, Jul 4, 2019 at 7:02 AM Johannes Schindelin
> <Johannes.Schindelin@xxxxxx> wrote:
> >
> > On Wed, 3 Jul 2019, Junio C Hamano wrote:
> >
> > > * mt/dir-iterator-updates (2019-06-25) 10 commits
> [...]
> > >  Is this ready for 'next'?
> >
> > No. It still breaks many dozens of test cases on Windows (if not more)
> > because it thinks that it can rely on `st_ino` to detect circular
> > symlinks.
>
> I wanted to take a look at the failures to see if I could help, but
> I'm not very familiar with azure (and travis-ci doesn't run windows'
> tests). So the only build I could find, containing the code from this
> series, is this[1]. But it only shows 4 failures and they don't seem
> to relate with dir-iterator... Could you point me to the right place,
> please?

Certainly. In
https://github.com/gitgitgadget/git/branches/all?utf8=%E2%9C%93&query=mt%2F,
you will find the red X next to the branch name `mt/dir-iterator-updates`,
and clicking on it will pop up the list of jobs (including the failing
ones).

If you click on any item, it will first get you to the "Checks" page where
you can find the link "View more details on Azure Pipelines" on the bottom
of the right-hand side pane. This will lead you to
https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11495

I usually click on the "Tests" tab in that page:
https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11495&view=ms.vss-test-web.build-test-results-tab

You can click on any of the 1,384 (!) failing test cases, it will pop up a
pane on the right-hand side that shows the trace of that failing test
case. For the full trace of that test script, go to "Attachments" and
download the `Standard_Error_Output.log` (via the horizontal bread-crumbs
menu you can see when hovering over the file name).

> > In
> > https://public-inbox.org/git/nycvar.QRO.7.76.6.1906272046180.44@xxxxxxxxxxxxxxxxx/
> > I had suggested to do something like this:
> >
> [...]
> >
> > However, in the meantime I thought about this a bit more and I remembered
> > how this is done elsewhere: I saw many recursive symlink resolvers that
> > just have an arbitrary cut-off after following, say, 32 links.
> >
> > In fact, Git itself already has this in abspath.c:
> >
> >         /* We allow "recursive" symbolic links. Only within reason, though. */
> >         #ifndef MAXSYMLINKS
> >         #define MAXSYMLINKS 32
> >         #endif
> >
> > But then, the patch in question uses `stat()` instead of `lstat()`, so we
> > would not have any way to count the number of symbolic links we followed.
>
> Hm, I think `stat()` itself uses this strategy of an arbitrary cut-off
> when resolving a path. So we may also "ignore" circular symlinks and
> let the iteration continue until the point where `stat()` will return
> an ELOOP. (But it may be expensive...)

This would not be equivalent, though, as your code also tried to address
circular references when two or more symlinks are involved, e.g. when
one symlink points to a directory that has another symlink that points to
the directory containing the first symlink.

> > Do we _have_ to, though? At some stage the path we come up with is beyond
> > `PATH_MAX` and we can stop right then and there.
> >
> > Besides, the way `find_recursive_symlinks()` is implemented adds quadratic
> > behavior.
>
> Yes, indeed. But it only happens when we have a path like this:
> `symlink_to_dir1/symlink_to_dir2/symlink_to_dir3/symlink_to_dir4/...`,
> right? I think this case won't be very usual on actual filesystems,
> thought.

No, the check is performed in a loop, and that loop is executed whether
you have symlinks or not. That loop is executed for every item in the
iteration, and as we cannot assume a flat directory in general (in fact,
we should assume a directory depth proportional to the total number of
files), that adds that quadratic behavior.

> > So I would like to submit the idea of simplifying the code drastically,
> > by skipping the `find_recursive_symlinks()` function altogether.
> >
> > This would solve another issue I have with that function, anyway: The name
> > suggests, to me at least, that we follow symlinks recursively. It does
> > not. I think that could have been addressed by using the adjective
> > "circular" instead of "recursive".
>
> Indeed, "circular" sounds much better then "recursive".
>
> > But I also think there are enough
> > reasons to do away with this function in the first place.
>
> We can delegate the circular symlinks problem to `stat()'s ELOOP`

Not really. I mean, we _already_ delegate to the `ELOOP` condition, we
cannot even avoid it as long as we keep using `stat()` instead of
`lstat()`, but as I demonstrated above, that only catches part of the
circular symlinks.

> or `path_len > PATH_MAX`.

This would have the advantage of _not_ adding quadratic behavior.

And just in case you think quadratic behavior would not matter much: Git
is used to manage the largest repository on this planet, which has over 3
million index entries when checked out.

Quadratic behavior matters.

I don't know where the dir-iterator is used, but we simply should try our
best to aim for the optimal time complexity in the first place.

> The only downside is the overhead of iterating through directories which
> will be latter discarded for being in circular symlinks' chains. I mean,
> the overhead at dir-iterator shouldn't be much, but the one on the code
> using this API to do something for each of these directories (and its
> contents), may be. Also we would need to "undo" the work done for each
> of these directories if we want to ignore circular symlinks and continue
> the iteration, whereas if we try to detect it a priori, we can skip it
> from the beginning.

Given that the intent of this patch series is a mere refactoring, I wonder
why the new, contentious circular symlink detection is slipped into it
anyway. It complicates the task, makes reviewing a lot harder, and it
holds up the refactoring.

Ciao,
Dscho




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux