Re: [PATCH v2 00/13] name-rev: eliminate recursion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Johannes,

The patch works very well on Windows.

Tested it on a repository with 146823 commits and for every commit a note.

Groeten,
Sebastiaan Dammann

On Tue, 12 Nov 2019 at 20:18, Johannes Schindelin
<Johannes.Schindelin@xxxxxx> wrote:
>
> Hi,
>
> [Cc:ing Sebastian, as they indicated in
> https://public-inbox.org/git/CAE7Eq9hEiVf1rMNdWx55_nQsz2gVv0N%2Bs1KckK1evtmruqcHyA@xxxxxxxxxxxxxx/t/#u
> that they would be interested in testing this]
>
> Sebastian, could you test this patch series? Since you are on Windows,
> you should be able to do so by
>
> - installing Git for Windows' SDK
>   (https://gitforwindows.org/#download-sdk)
> - `sdk cd git` (possibly `sdk init git`, although that should be
>   implied)
> - `sdk init build-extra` followed by
>   `/usr/src/build-extra/apply-from-public-inbox.sh
>   https://public-inbox.org/git/20191112103821.30265-1-szeder.dev@xxxxxxxxx/`
> - `sdk build`
>
> The result should include a `git.exe` in `/usr/src/git/` that you can
> copy to your server and test via `/path/to/git.exe name-rev ...`.
>
> Ciao,
> Johannes
>
> On Tue, 12 Nov 2019, SZEDER Gábor wrote:
>
> > 'git name-rev' is implemented using a recursive algorithm, and,
> > consequently, it can segfault in deep histories (e.g. WebKit), and
> > thanks to a test case demonstrating this limitation every test run
> > results in a dmesg entry logging the segfaulting git process.
> >
> > This patch series eliminates the recursion.
> >
> > Patches 1-5 are while-at-it cleanups I noticed on the way, and patch 6
> > improves test coverage.  Patches 7-11 are preparatory refactorings
> > that are supposed to make this series easier to follow, and make patch
> > 12, the one finally eliminating the recursion, somewhat shorter, and
> > even much shorter when viewed with '--ignore-all-space'.  Patch 13
> > cleans up after those preparatory steps.
> >
> > Changes since v1:
> >
> >   - Patch 12 now eliminates the recursion using a LIFO 'prio_queue'
> >     instead of a 'commit_list' to avoid any performance penalty.
> >
> >   - Commit message updates, clarifications, typofixes, missing
> >     signoffs, etc., most notably in patches 6 and 12.
> >
> >   - Updated ASCII art history graphs.
> >
> >   - Replaced the strbuf_suffix() cleanup in patch 3 with René's
> >     suggestion; now that patch needs his signoff.
> >
> >   - Dropped the last two patches plugging memory leaks; René's plan
> >     to clean up memory ownership looked more promising, and that
> >     would make these two dropped patches moot anyway.
> >
> > v1: https://public-inbox.org/git/20190919214712.7348-1-szeder.dev@xxxxxxxxx/T/#u
> >
> > René Scharfe (1):
> >   name-rev: use strbuf_strip_suffix() in get_rev_name()
> >
> > SZEDER Gábor (12):
> >   t6120-describe: correct test repo history graph in comment
> >   t6120-describe: modernize the 'check_describe' helper
> >   name-rev: avoid unnecessary cast in name_ref()
> >   name-rev: use sizeof(*ptr) instead of sizeof(type) in allocation
> >   t6120: add a test to cover inner conditions in 'git name-rev's
> >     name_rev()
> >   name-rev: extract creating/updating a 'struct name_rev' into a helper
> >   name-rev: pull out deref handling from the recursion
> >   name-rev: restructure parsing commits and applying date cutoff
> >   name-rev: restructure creating/updating 'struct rev_name' instances
> >   name-rev: drop name_rev()'s 'generation' and 'distance' parameters
> >   name-rev: eliminate recursion in name_rev()
> >   name-rev: cleanup name_ref()
> >
> >  builtin/name-rev.c  | 147 +++++++++++++++++++++++++++++---------------
> >  t/t6120-describe.sh |  72 +++++++++++++++++-----
> >  2 files changed, 153 insertions(+), 66 deletions(-)
> >
> > Range-diff:
> >  1:  673da20e3d !  1:  8d70ed050d t6120-describe: correct test repo history graph in comment
> >     @@ t/t6120-describe.sh
> >      -test_description='test describe
> >      +test_description='test describe'
> >      +
> >     -+#       ,---o----o----o-----.
> >     -+#      /   D,R   e           \
> >     -+#  o--o-----o-------------o---o----x
> >     -+#      \    B            /
> >     -+#       `---o----o----o-'
> >     -+#                A    c
> >     ++#  o---o-----o----o----o-------o----x
> >     ++#       \   D,R   e           /
> >     ++#        \---o-------------o-'
> >     ++#         \  B            /
> >     ++#          `-o----o----o-'
> >     ++#                 A    c
> >     ++#
> >     ++# First parent of a merge commit is on the same line, second parent below.
> >
> >      -                       B
> >      -        .--------------o----o----o----x
> >  2:  05df899693 =  2:  3720b6859d t6120-describe: modernize the 'check_describe' helper
> >  3:  7b0227cfea !  3:  ad2f2eee68 name-rev: use strip_suffix() in get_rev_name()
> >     @@
> >       ## Metadata ##
> >     -Author: SZEDER Gábor <szeder.dev@xxxxxxxxx>
> >     +Author: René Scharfe <l.s.r@xxxxxx>
> >
> >       ## Commit message ##
> >     -    name-rev: use strip_suffix() in get_rev_name()
> >     +    name-rev: use strbuf_strip_suffix() in get_rev_name()
> >
> >     -    Use strip_suffix() instead of open-coding it, making the code more
> >     -    idiomatic.
> >     +    get_name_rev() basically open-codes strip_suffix() before adding a
> >     +    string to a strbuf.
> >
> >     +    Let's use the strbuf right from the beginning, i.e. add the whole
> >     +    string to the strbuf and then use strbuf_strip_suffix(), making the
> >     +    code more idiomatic.
> >     +
> >     +    [TODO: René's signoff!]
> >          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
> >
> >       ## builtin/name-rev.c ##
> >     @@ builtin/name-rev.c: static const char *get_rev_name(const struct object *o, stru
> >      -                int len = strlen(n->tip_name);
> >      -                if (len > 2 && !strcmp(n->tip_name + len - 2, "^0"))
> >      -                        len -= 2;
> >     -+                size_t len;
> >     -+                strip_suffix(n->tip_name, "^0", &len);
> >                       strbuf_reset(buf);
> >      -                strbuf_addf(buf, "%.*s~%d", len, n->tip_name, n->generation);
> >     -+                strbuf_addf(buf, "%.*s~%d", (int) len, n->tip_name,
> >     -+                            n->generation);
> >     ++                strbuf_addstr(buf, n->tip_name);
> >     ++                strbuf_strip_suffix(buf, "^0");
> >     ++                strbuf_addf(buf, "~%d", n->generation);
> >                       return buf->buf;
> >               }
> >       }
> >  4:  40faecdc2a =  4:  c86a2ae2d0 name-rev: avoid unnecessary cast in name_ref()
> >  5:  c71df3dadf =  5:  4fc960cc05 name-rev: use sizeof(*ptr) instead of sizeof(type) in allocation
> >  6:  1dcb76072f !  6:  1493cb4484 t6120: add a test to cover inner conditions in 'git name-rev's name_rev()
> >     @@ Commit message
> >          looks like this:
> >
> >            if (parent_number > 1) {
> >     -        if (generation > 0)
> >     -          // do stuff #1
> >     -        else
> >     -          // do stuff #2
> >     +          if (generation > 0)
> >     +              // branch #1
> >     +              new_name = ...
> >     +          else
> >     +              // branch #2
> >     +              new_name = ...
> >     +          name_rev(parent, new_name, ...);
> >            } else {
> >     -         // do stuff #3
> >     +          // branch #3
> >     +          name_rev(...);
> >            }
> >
> >          These conditions are not covered properly in the test suite.  As far
> >     @@ Commit message
> >          command's output, because the repository used in that test script
> >          contains several branches and tags pointing somewhere into the middle
> >          of the commit DAG, and thus result in a better name for the
> >     -    to-be-named commit.  In an early version of this patch series I
> >     -    managed to mess up those conditions (every single one of them at
> >     -    once!), but the whole test suite still passed successfully.
> >     +    to-be-named commit.  This can hide bugs: e.g. by replacing the
> >     +    'new_name' parameter of the first recursive name_rev() call with
> >     +    'tip_name' (effectively making both branch #1 and #2 a noop) 'git
> >     +    name-rev --all' shows thousands of bogus names in the Git repository,
> >     +    but the whole test suite still passes successfully.  In an early
> >     +    version of a later patch in this series I managed to mess up all three
> >     +    branches (at once!), but the test suite still passed.
> >
> >          So add a new test case that operates on the following history:
> >
> >     -        -----------master
> >     -       /          /
> >     -      A----------M2
> >     -       \        /
> >     -        \---M1-C
> >     -         \ /
> >     -          B
> >     +      A--------------master
> >     +       \            /
> >     +        \----------M2
> >     +         \        /
> >     +          \---M1-C
> >     +           \ /
> >     +            B
> >
> >     -    and names the commit 'B', where:
> >     +    and names the commit 'B' to make sure that all three branches are
> >     +    crucial to determine 'B's name:
> >
> >     -      - The merge commit at master makes sure that the 'do stuff #3'
> >     -        affects the final name.
> >     +      - There is only a single ref, so all names are based on 'master',
> >     +        without any undesired interference from other refs.
> >
> >     -      - The merge commit M2 make sure that the 'do stuff #1' part
> >     -        affects the final name.
> >     +      - Each time name_rev() follows the second parent of a merge commit,
> >     +        it appends "^2" to the name.  Following 'master's second parent
> >     +        right at the start ensures that all commits on the ancestry path
> >     +        from 'master' to 'B' have a different base name from the original
> >     +        'tip_name' of the very first name_rev() invocation.  Currently,
> >     +        while name_rev() is recursive, it doesn't matter, but it will be
> >     +        necessary to properly cover all three branches after the recursion
> >     +        is eliminated later in this series.
> >
> >     -      - And M1 makes sure that the 'do stuff #2' part affects the final
> >     -        name.
> >     +      - Following 'M2's second parent makes sure that branch #2 (i.e. when
> >     +        'generation = 0') affects 'B's name.
> >     +
> >     +      - Following the only parent of the non-merge commit 'C' ensures that
> >     +        branch #3 affects 'B's name, and that it increments 'generation'.
> >     +
> >     +      - Coming from 'C' 'generation' is 1, thus following 'M1's second
> >     +        parent makes sure that branch #1 affects 'B's name.
> >
> >          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
> >
> >       ## t/t6120-describe.sh ##
> >     -@@ t/t6120-describe.sh: test_expect_success 'describe complains about missing object' '
> >     -         test_must_fail git describe $ZERO_OID
> >     +@@ t/t6120-describe.sh: test_expect_success 'name-rev a rev shortly after epoch' '
> >     +         test_cmp expect actual
> >       '
> >
> >     -+#   -----------master
> >     -+#  /          /
> >     -+# A----------M2
> >     -+#  \        /
> >     -+#   \---M1-C
> >     -+#    \ /
> >     -+#     B
> >     -+test_expect_success 'test' '
> >     ++# A--------------master
> >     ++#  \            /
> >     ++#   \----------M2
> >     ++#    \        /
> >     ++#     \---M1-C
> >     ++#      \ /
> >     ++#       B
> >     ++test_expect_success 'name-rev covers all conditions while looking at parents' '
> >      +        git init repo &&
> >      +        (
> >      +                cd repo &&
> >     @@ t/t6120-describe.sh: test_expect_success 'describe complains about missing objec
> >      +                git checkout master &&
> >      +                git merge --no-ff HEAD@{1} &&
> >      +
> >     -+                git log --graph --oneline &&
> >     -+
> >      +                echo "$B master^2^2~1^2" >expect &&
> >      +                git name-rev $B >actual &&
> >      +
> >  7:  bdd8378b06 =  7:  fc842e578b name-rev: extract creating/updating a 'struct name_rev' into a helper
> >  8:  ce21c351f9 !  8:  7f182503e2 name-rev: pull out deref handling from the recursion
> >     @@ Commit message
> >          Extract this condition from the recursion into name_rev()'s caller and
> >          drop the function's 'deref' parameter.  This makes eliminating the
> >          recursion a bit easier to follow, and it will be moved back into
> >     -    name_rev() after the recursion is elminated.
> >     +    name_rev() after the recursion is eliminated.
> >
> >          Furthermore, drop the condition that die()s when both 'deref' and
> >          'generation' are non-null (which should have been a BUG() to begin
> >     @@ Commit message
> >
> >          Note that this change reintroduces the memory leak that was plugged in
> >          in commit 5308224633 (name-rev: avoid leaking memory in the `deref`
> >     -    case, 2017-05-04), but a later patch in this series will plug it in
> >     -    again.
> >     +    case, 2017-05-04), but a later patch (name-rev: restructure
> >     +    creating/updating 'struct rev_name' instances) in this series will
> >     +    plug it in again.
> >
> >          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
> >
> >  9:  c8acc6b597 !  9:  0cdd40b75b name-rev: restructure parsing commits and applying date cutoff
> >     @@ Commit message
> >          name_rev() and name_rev() itself as it iterates over the parent
> >          commits.
> >
> >     -    This makes eliminating the recursion a bit easier to follow, and it
> >     -    will be moved back to name_rev() after the recursion is eliminated.
> >     +    This makes eliminating the recursion a bit easier to follow, and the
> >     +    condition moved to name_ref() will be moved back to name_rev() after
> >     +    the recursion is eliminated.
> >
> >          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
> >
> > 10:  c731f27158 ! 10:  e1733e3c56 name-rev: restructure creating/updating 'struct rev_name' instances
> >     @@ Commit message
> >          At the beginning of the recursive name_rev() function it creates a new
> >          'struct rev_name' instance for each previously unvisited commit or, if
> >          this visit results in better name for an already visited commit, then
> >     -    updates the 'struct rev_name' instance attached to to the commit, or
> >     +    updates the 'struct rev_name' instance attached to the commit, or
> >          returns early.
> >
> >          Restructure this so it's caller creates or updates the 'struct
> >     @@ Commit message
> >          parameter, i.e. both name_ref() before calling name_rev() and
> >          name_rev() itself as it iterates over the parent commits.
> >
> >     -    This makes eliminating the recursion a bit easier to follow, and it
> >     -    will be moved back to name_rev() after the recursion is eliminated.
> >     +    This makes eliminating the recursion a bit easier to follow, and the
> >     +    condition moved to name_ref() will be moved back to name_rev() after
> >     +    the recursion is eliminated.
> >
> >          This change also plugs the memory leak that was temporarily unplugged
> >          in the earlier "name-rev: pull out deref handling from the recursion"
> > 11:  ba14bde230 ! 11:  bd6e2e6d87 name-rev: drop name_rev()'s 'generation' and 'distance' parameters
> >     @@ Commit message
> >          'taggerdate' and 'from_tag' parameters as well, but those parameters
> >          will be necessary later, after the recursion is eliminated.
> >
> >     -    Drop name_rev()'s 'generation' and 'distance' parameters.
> >     +    Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
> >
> >       ## builtin/name-rev.c ##
> >      @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit *commit,
> > 12:  2d03ac11f3 ! 12:  0cf63c6d64 name-rev: eliminate recursion in name_rev()
> >     @@ Commit message
> >          segfault when processing a deep history if it exhausts the available
> >          stack space.  E.g. running 'git name-rev --all' and 'git name-rev
> >          HEAD~100000' in the gcc, gecko-dev, llvm, and WebKit repositories
> >     -    results in segfaults on my machine.
> >     +    results in segfaults on my machine ('ulimit -s' reports 8192kB of
> >     +    stack size limit), and nowadays the former segfaults in the Linux repo
> >     +    as well (it reached the necessasry depth sometime between v5.3-rc4 and
> >     +    -rc5).
> >
> >          Eliminate the recursion by inserting the interesting parents into a
> >     -    'commit_list' and iteratating until the list becomes empty.
> >     +    LIFO 'prio_queue' [1] and iterating until the queue becomes empty.
> >
> >     -    Note that the order in which the parent commits are added to that list
> >     -    is important: they must be inserted at the beginning of the list, and
> >     -    their relative order must be kept as well, because otherwise
> >     -    performance suffers.
> >     +    Note that the parent commits must be added in reverse order to the
> >     +    LIFO 'prio_queue', so their relative order is preserved during
> >     +    processing, i.e. the first parent should come out first from the
> >     +    queue, because otherwise performance greatly suffers on mergy
> >     +    histories [2].
> >
> >          The stacksize-limited test 'name-rev works in a deep repo' in
> >          't6120-describe.sh' demonstrated this issue and expected failure.  Now
> >     -    the recursion is gone, so flip it to expect success.
> >     -
> >     -    Also gone are the dmesg entries logging the segfault of the git
> >     -    process on every execution of the test suite.
> >     -
> >     -    Unfortunately, eliminating the recursion comes with a performance
> >     -    penaly: 'git name-rev --all' tends to be between 15-20% slower than
> >     -    before.
> >     +    the recursion is gone, so flip it to expect success.  Also gone are
> >     +    the dmesg entries logging the segfault of that segfaulting 'git
> >     +    name-rev' process on every execution of the test suite.
> >
> >          Note that this slightly changes the order of lines in the output of
> >          'git name-rev --all', usually swapping two lines every 35 lines in
> >     @@ Commit message
> >
> >          This patch is best viewed with '--ignore-all-space'.
> >
> >     +    [1] Early versions of this patch used a 'commit_list', resulting in
> >     +        ~15% performance penalty for 'git name-rev --all' in 'linux.git',
> >     +        presumably because of the memory allocation and release for each
> >     +        insertion and removal. Using a LIFO 'prio_queue' has basically no
> >     +        effect on performance.
> >     +
> >     +    [2] We prefer shorter names, i.e. 'v0.1~234' is preferred over
> >     +        'v0.1^2~5', meaning that usually following the first parent of a
> >     +        merge results in the best name for its ancestors.  So when later
> >     +        we follow the remaining parent(s) of a merge, and reach an already
> >     +        named commit, then we usually find that we can't give that commit
> >     +        a better name, and thus we don't have to visit any of its
> >     +        ancestors again.
> >     +
> >     +        OTOH, if we were to follow the Nth parent of the merge first, then
> >     +        the name of all its ancestors would include a corresponding '^N'.
> >     +        Those are not the best names for those commits, so when later we
> >     +        reach an already named commit following the first parent of that
> >     +        merge, then we would have to update the name of that commit and
> >     +        the names of all of its ancestors as well.  Consequently, we would
> >     +        have to visit many commits several times, resulting in a
> >     +        significant slowdown.
> >     +
> >          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
> >
> >       ## builtin/name-rev.c ##
> >     +@@
> >     + #include "tag.h"
> >     + #include "refs.h"
> >     + #include "parse-options.h"
> >     ++#include "prio-queue.h"
> >     + #include "sha1-lookup.h"
> >     + #include "commit-slab.h"
> >     +
> >      @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit *commit,
> >                       return NULL;
> >       }
> >     @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit
> >      -                parse_commit(parent);
> >      -                if (parent->date < cutoff)
> >      -                        continue;
> >     -+        struct commit_list *list = NULL;
> >     ++        struct prio_queue queue;
> >     ++        struct commit *commit;
> >     ++        struct commit **parents_to_queue = NULL;
> >     ++        size_t parents_to_queue_nr, parents_to_queue_alloc = 0;
> >      +
> >     -+        commit_list_insert(start_commit, &list);
> >     ++        memset(&queue, 0, sizeof(queue)); /* Use the prio_queue as LIFO */
> >     ++        prio_queue_put(&queue, start_commit);
> >      +
> >     -+        while (list) {
> >     -+                struct commit *commit = pop_commit(&list);
> >     ++        while ((commit = prio_queue_get(&queue))) {
> >      +                struct rev_name *name = get_commit_rev_name(commit);
> >     -+                struct commit_list *parents, *new_parents = NULL;
> >     -+                struct commit_list **last_new_parent = &new_parents;
> >     ++                struct commit_list *parents;
> >      +                int parent_number = 1;
> >      +
> >     ++                parents_to_queue_nr = 0;
> >     ++
> >      +                for (parents = commit->parents;
> >      +                                parents;
> >      +                                parents = parents->next, parent_number++) {
> >     @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit
> >      -                        distance = name->distance + 1;
> >      +                        if (create_or_update_name(parent, new_name, taggerdate,
> >      +                                                  generation, distance,
> >     -+                                                  from_tag))
> >     -+                                last_new_parent = commit_list_append(parent,
> >     -+                                                  last_new_parent);
> >     ++                                                  from_tag)) {
> >     ++                                ALLOC_GROW(parents_to_queue,
> >     ++                                           parents_to_queue_nr + 1,
> >     ++                                           parents_to_queue_alloc);
> >     ++                                parents_to_queue[parents_to_queue_nr] = parent;
> >     ++                                parents_to_queue_nr++;
> >     ++                        }
> >                       }
> >
> >      -                if (create_or_update_name(parent, new_name, taggerdate,
> >      -                                          generation, distance,
> >      -                                          from_tag))
> >      -                        name_rev(parent, new_name, taggerdate, from_tag);
> >     -+                *last_new_parent = list;
> >     -+                list = new_parents;
> >     ++                /* The first parent must come out first from the prio_queue */
> >     ++                while (parents_to_queue_nr)
> >     ++                        prio_queue_put(&queue,
> >     ++                                       parents_to_queue[--parents_to_queue_nr]);
> >               }
> >     ++
> >     ++        clear_prio_queue(&queue);
> >     ++        free(parents_to_queue);
> >       }
> >
> >     + static int subpath_matches(const char *path, const char *filter)
> >
> >       ## t/t6120-describe.sh ##
> >      @@ t/t6120-describe.sh: test_expect_success 'describe tag object' '
> > 13:  1ef69550ca ! 13:  316f7af43c name-rev: cleanup name_ref()
> >     @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit
> >      -                int from_tag)
> >      +                int from_tag, int deref)
> >       {
> >     -         struct commit_list *list = NULL;
> >     +         struct prio_queue queue;
> >     +         struct commit *commit;
> >     +         struct commit **parents_to_queue = NULL;
> >     +         size_t parents_to_queue_nr, parents_to_queue_alloc = 0;
> >      +        char *to_free = NULL;
> >      +
> >      +        parse_commit(start_commit);
> >     @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit
> >      +                return;
> >      +        }
> >
> >     -         commit_list_insert(start_commit, &list);
> >     -
> >     +         memset(&queue, 0, sizeof(queue)); /* Use the prio_queue as LIFO */
> >     +         prio_queue_put(&queue, start_commit);
> >      @@ builtin/name-rev.c: static int name_ref(const char *path, const struct object_id *oid, int flags, vo
> >                       if (taggerdate == TIME_MAX)
> >                               taggerdate = commit->date;
> > 14:  9d513b3092 <  -:  ---------- name-rev: plug a memory leak in name_rev()
> > 15:  8489abb62e <  -:  ---------- name-rev: plug a memory leak in name_rev() in the deref case
> > --
> > 2.24.0.388.gde53c094ea
> >
> >




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux