Re: [PATCH v2 00/13] name-rev: eliminate recursion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

[Cc:ing Sebastian, as they indicated in
https://public-inbox.org/git/CAE7Eq9hEiVf1rMNdWx55_nQsz2gVv0N%2Bs1KckK1evtmruqcHyA@xxxxxxxxxxxxxx/t/#u
that they would be interested in testing this]

Sebastian, could you test this patch series? Since you are on Windows,
you should be able to do so by

- installing Git for Windows' SDK
  (https://gitforwindows.org/#download-sdk)
- `sdk cd git` (possibly `sdk init git`, although that should be
  implied)
- `sdk init build-extra` followed by
  `/usr/src/build-extra/apply-from-public-inbox.sh
  https://public-inbox.org/git/20191112103821.30265-1-szeder.dev@xxxxxxxxx/`
- `sdk build`

The result should include a `git.exe` in `/usr/src/git/` that you can
copy to your server and test via `/path/to/git.exe name-rev ...`.

Ciao,
Johannes

On Tue, 12 Nov 2019, SZEDER Gábor wrote:

> 'git name-rev' is implemented using a recursive algorithm, and,
> consequently, it can segfault in deep histories (e.g. WebKit), and
> thanks to a test case demonstrating this limitation every test run
> results in a dmesg entry logging the segfaulting git process.
>
> This patch series eliminates the recursion.
>
> Patches 1-5 are while-at-it cleanups I noticed on the way, and patch 6
> improves test coverage.  Patches 7-11 are preparatory refactorings
> that are supposed to make this series easier to follow, and make patch
> 12, the one finally eliminating the recursion, somewhat shorter, and
> even much shorter when viewed with '--ignore-all-space'.  Patch 13
> cleans up after those preparatory steps.
>
> Changes since v1:
>
>   - Patch 12 now eliminates the recursion using a LIFO 'prio_queue'
>     instead of a 'commit_list' to avoid any performance penalty.
>
>   - Commit message updates, clarifications, typofixes, missing
>     signoffs, etc., most notably in patches 6 and 12.
>
>   - Updated ASCII art history graphs.
>
>   - Replaced the strbuf_suffix() cleanup in patch 3 with René's
>     suggestion; now that patch needs his signoff.
>
>   - Dropped the last two patches plugging memory leaks; René's plan
>     to clean up memory ownership looked more promising, and that
>     would make these two dropped patches moot anyway.
>
> v1: https://public-inbox.org/git/20190919214712.7348-1-szeder.dev@xxxxxxxxx/T/#u
>
> René Scharfe (1):
>   name-rev: use strbuf_strip_suffix() in get_rev_name()
>
> SZEDER Gábor (12):
>   t6120-describe: correct test repo history graph in comment
>   t6120-describe: modernize the 'check_describe' helper
>   name-rev: avoid unnecessary cast in name_ref()
>   name-rev: use sizeof(*ptr) instead of sizeof(type) in allocation
>   t6120: add a test to cover inner conditions in 'git name-rev's
>     name_rev()
>   name-rev: extract creating/updating a 'struct name_rev' into a helper
>   name-rev: pull out deref handling from the recursion
>   name-rev: restructure parsing commits and applying date cutoff
>   name-rev: restructure creating/updating 'struct rev_name' instances
>   name-rev: drop name_rev()'s 'generation' and 'distance' parameters
>   name-rev: eliminate recursion in name_rev()
>   name-rev: cleanup name_ref()
>
>  builtin/name-rev.c  | 147 +++++++++++++++++++++++++++++---------------
>  t/t6120-describe.sh |  72 +++++++++++++++++-----
>  2 files changed, 153 insertions(+), 66 deletions(-)
>
> Range-diff:
>  1:  673da20e3d !  1:  8d70ed050d t6120-describe: correct test repo history graph in comment
>     @@ t/t6120-describe.sh
>      -test_description='test describe
>      +test_description='test describe'
>      +
>     -+#       ,---o----o----o-----.
>     -+#      /   D,R   e           \
>     -+#  o--o-----o-------------o---o----x
>     -+#      \    B            /
>     -+#       `---o----o----o-'
>     -+#                A    c
>     ++#  o---o-----o----o----o-------o----x
>     ++#       \   D,R   e           /
>     ++#        \---o-------------o-'
>     ++#         \  B            /
>     ++#          `-o----o----o-'
>     ++#                 A    c
>     ++#
>     ++# First parent of a merge commit is on the same line, second parent below.
>
>      -                       B
>      -        .--------------o----o----o----x
>  2:  05df899693 =  2:  3720b6859d t6120-describe: modernize the 'check_describe' helper
>  3:  7b0227cfea !  3:  ad2f2eee68 name-rev: use strip_suffix() in get_rev_name()
>     @@
>       ## Metadata ##
>     -Author: SZEDER Gábor <szeder.dev@xxxxxxxxx>
>     +Author: René Scharfe <l.s.r@xxxxxx>
>
>       ## Commit message ##
>     -    name-rev: use strip_suffix() in get_rev_name()
>     +    name-rev: use strbuf_strip_suffix() in get_rev_name()
>
>     -    Use strip_suffix() instead of open-coding it, making the code more
>     -    idiomatic.
>     +    get_name_rev() basically open-codes strip_suffix() before adding a
>     +    string to a strbuf.
>
>     +    Let's use the strbuf right from the beginning, i.e. add the whole
>     +    string to the strbuf and then use strbuf_strip_suffix(), making the
>     +    code more idiomatic.
>     +
>     +    [TODO: René's signoff!]
>          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
>
>       ## builtin/name-rev.c ##
>     @@ builtin/name-rev.c: static const char *get_rev_name(const struct object *o, stru
>      -		int len = strlen(n->tip_name);
>      -		if (len > 2 && !strcmp(n->tip_name + len - 2, "^0"))
>      -			len -= 2;
>     -+		size_t len;
>     -+		strip_suffix(n->tip_name, "^0", &len);
>       		strbuf_reset(buf);
>      -		strbuf_addf(buf, "%.*s~%d", len, n->tip_name, n->generation);
>     -+		strbuf_addf(buf, "%.*s~%d", (int) len, n->tip_name,
>     -+			    n->generation);
>     ++		strbuf_addstr(buf, n->tip_name);
>     ++		strbuf_strip_suffix(buf, "^0");
>     ++		strbuf_addf(buf, "~%d", n->generation);
>       		return buf->buf;
>       	}
>       }
>  4:  40faecdc2a =  4:  c86a2ae2d0 name-rev: avoid unnecessary cast in name_ref()
>  5:  c71df3dadf =  5:  4fc960cc05 name-rev: use sizeof(*ptr) instead of sizeof(type) in allocation
>  6:  1dcb76072f !  6:  1493cb4484 t6120: add a test to cover inner conditions in 'git name-rev's name_rev()
>     @@ Commit message
>          looks like this:
>
>            if (parent_number > 1) {
>     -        if (generation > 0)
>     -          // do stuff #1
>     -        else
>     -          // do stuff #2
>     +          if (generation > 0)
>     +              // branch #1
>     +              new_name = ...
>     +          else
>     +              // branch #2
>     +              new_name = ...
>     +          name_rev(parent, new_name, ...);
>            } else {
>     -         // do stuff #3
>     +          // branch #3
>     +          name_rev(...);
>            }
>
>          These conditions are not covered properly in the test suite.  As far
>     @@ Commit message
>          command's output, because the repository used in that test script
>          contains several branches and tags pointing somewhere into the middle
>          of the commit DAG, and thus result in a better name for the
>     -    to-be-named commit.  In an early version of this patch series I
>     -    managed to mess up those conditions (every single one of them at
>     -    once!), but the whole test suite still passed successfully.
>     +    to-be-named commit.  This can hide bugs: e.g. by replacing the
>     +    'new_name' parameter of the first recursive name_rev() call with
>     +    'tip_name' (effectively making both branch #1 and #2 a noop) 'git
>     +    name-rev --all' shows thousands of bogus names in the Git repository,
>     +    but the whole test suite still passes successfully.  In an early
>     +    version of a later patch in this series I managed to mess up all three
>     +    branches (at once!), but the test suite still passed.
>
>          So add a new test case that operates on the following history:
>
>     -        -----------master
>     -       /          /
>     -      A----------M2
>     -       \        /
>     -        \---M1-C
>     -         \ /
>     -          B
>     +      A--------------master
>     +       \            /
>     +        \----------M2
>     +         \        /
>     +          \---M1-C
>     +           \ /
>     +            B
>
>     -    and names the commit 'B', where:
>     +    and names the commit 'B' to make sure that all three branches are
>     +    crucial to determine 'B's name:
>
>     -      - The merge commit at master makes sure that the 'do stuff #3'
>     -        affects the final name.
>     +      - There is only a single ref, so all names are based on 'master',
>     +        without any undesired interference from other refs.
>
>     -      - The merge commit M2 make sure that the 'do stuff #1' part
>     -        affects the final name.
>     +      - Each time name_rev() follows the second parent of a merge commit,
>     +        it appends "^2" to the name.  Following 'master's second parent
>     +        right at the start ensures that all commits on the ancestry path
>     +        from 'master' to 'B' have a different base name from the original
>     +        'tip_name' of the very first name_rev() invocation.  Currently,
>     +        while name_rev() is recursive, it doesn't matter, but it will be
>     +        necessary to properly cover all three branches after the recursion
>     +        is eliminated later in this series.
>
>     -      - And M1 makes sure that the 'do stuff #2' part affects the final
>     -        name.
>     +      - Following 'M2's second parent makes sure that branch #2 (i.e. when
>     +        'generation = 0') affects 'B's name.
>     +
>     +      - Following the only parent of the non-merge commit 'C' ensures that
>     +        branch #3 affects 'B's name, and that it increments 'generation'.
>     +
>     +      - Coming from 'C' 'generation' is 1, thus following 'M1's second
>     +        parent makes sure that branch #1 affects 'B's name.
>
>          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
>
>       ## t/t6120-describe.sh ##
>     -@@ t/t6120-describe.sh: test_expect_success 'describe complains about missing object' '
>     - 	test_must_fail git describe $ZERO_OID
>     +@@ t/t6120-describe.sh: test_expect_success 'name-rev a rev shortly after epoch' '
>     + 	test_cmp expect actual
>       '
>
>     -+#   -----------master
>     -+#  /          /
>     -+# A----------M2
>     -+#  \        /
>     -+#   \---M1-C
>     -+#    \ /
>     -+#     B
>     -+test_expect_success 'test' '
>     ++# A--------------master
>     ++#  \            /
>     ++#   \----------M2
>     ++#    \        /
>     ++#     \---M1-C
>     ++#      \ /
>     ++#       B
>     ++test_expect_success 'name-rev covers all conditions while looking at parents' '
>      +	git init repo &&
>      +	(
>      +		cd repo &&
>     @@ t/t6120-describe.sh: test_expect_success 'describe complains about missing objec
>      +		git checkout master &&
>      +		git merge --no-ff HEAD@{1} &&
>      +
>     -+		git log --graph --oneline &&
>     -+
>      +		echo "$B master^2^2~1^2" >expect &&
>      +		git name-rev $B >actual &&
>      +
>  7:  bdd8378b06 =  7:  fc842e578b name-rev: extract creating/updating a 'struct name_rev' into a helper
>  8:  ce21c351f9 !  8:  7f182503e2 name-rev: pull out deref handling from the recursion
>     @@ Commit message
>          Extract this condition from the recursion into name_rev()'s caller and
>          drop the function's 'deref' parameter.  This makes eliminating the
>          recursion a bit easier to follow, and it will be moved back into
>     -    name_rev() after the recursion is elminated.
>     +    name_rev() after the recursion is eliminated.
>
>          Furthermore, drop the condition that die()s when both 'deref' and
>          'generation' are non-null (which should have been a BUG() to begin
>     @@ Commit message
>
>          Note that this change reintroduces the memory leak that was plugged in
>          in commit 5308224633 (name-rev: avoid leaking memory in the `deref`
>     -    case, 2017-05-04), but a later patch in this series will plug it in
>     -    again.
>     +    case, 2017-05-04), but a later patch (name-rev: restructure
>     +    creating/updating 'struct rev_name' instances) in this series will
>     +    plug it in again.
>
>          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
>
>  9:  c8acc6b597 !  9:  0cdd40b75b name-rev: restructure parsing commits and applying date cutoff
>     @@ Commit message
>          name_rev() and name_rev() itself as it iterates over the parent
>          commits.
>
>     -    This makes eliminating the recursion a bit easier to follow, and it
>     -    will be moved back to name_rev() after the recursion is eliminated.
>     +    This makes eliminating the recursion a bit easier to follow, and the
>     +    condition moved to name_ref() will be moved back to name_rev() after
>     +    the recursion is eliminated.
>
>          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
>
> 10:  c731f27158 ! 10:  e1733e3c56 name-rev: restructure creating/updating 'struct rev_name' instances
>     @@ Commit message
>          At the beginning of the recursive name_rev() function it creates a new
>          'struct rev_name' instance for each previously unvisited commit or, if
>          this visit results in better name for an already visited commit, then
>     -    updates the 'struct rev_name' instance attached to to the commit, or
>     +    updates the 'struct rev_name' instance attached to the commit, or
>          returns early.
>
>          Restructure this so it's caller creates or updates the 'struct
>     @@ Commit message
>          parameter, i.e. both name_ref() before calling name_rev() and
>          name_rev() itself as it iterates over the parent commits.
>
>     -    This makes eliminating the recursion a bit easier to follow, and it
>     -    will be moved back to name_rev() after the recursion is eliminated.
>     +    This makes eliminating the recursion a bit easier to follow, and the
>     +    condition moved to name_ref() will be moved back to name_rev() after
>     +    the recursion is eliminated.
>
>          This change also plugs the memory leak that was temporarily unplugged
>          in the earlier "name-rev: pull out deref handling from the recursion"
> 11:  ba14bde230 ! 11:  bd6e2e6d87 name-rev: drop name_rev()'s 'generation' and 'distance' parameters
>     @@ Commit message
>          'taggerdate' and 'from_tag' parameters as well, but those parameters
>          will be necessary later, after the recursion is eliminated.
>
>     -    Drop name_rev()'s 'generation' and 'distance' parameters.
>     +    Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
>
>       ## builtin/name-rev.c ##
>      @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit *commit,
> 12:  2d03ac11f3 ! 12:  0cf63c6d64 name-rev: eliminate recursion in name_rev()
>     @@ Commit message
>          segfault when processing a deep history if it exhausts the available
>          stack space.  E.g. running 'git name-rev --all' and 'git name-rev
>          HEAD~100000' in the gcc, gecko-dev, llvm, and WebKit repositories
>     -    results in segfaults on my machine.
>     +    results in segfaults on my machine ('ulimit -s' reports 8192kB of
>     +    stack size limit), and nowadays the former segfaults in the Linux repo
>     +    as well (it reached the necessasry depth sometime between v5.3-rc4 and
>     +    -rc5).
>
>          Eliminate the recursion by inserting the interesting parents into a
>     -    'commit_list' and iteratating until the list becomes empty.
>     +    LIFO 'prio_queue' [1] and iterating until the queue becomes empty.
>
>     -    Note that the order in which the parent commits are added to that list
>     -    is important: they must be inserted at the beginning of the list, and
>     -    their relative order must be kept as well, because otherwise
>     -    performance suffers.
>     +    Note that the parent commits must be added in reverse order to the
>     +    LIFO 'prio_queue', so their relative order is preserved during
>     +    processing, i.e. the first parent should come out first from the
>     +    queue, because otherwise performance greatly suffers on mergy
>     +    histories [2].
>
>          The stacksize-limited test 'name-rev works in a deep repo' in
>          't6120-describe.sh' demonstrated this issue and expected failure.  Now
>     -    the recursion is gone, so flip it to expect success.
>     -
>     -    Also gone are the dmesg entries logging the segfault of the git
>     -    process on every execution of the test suite.
>     -
>     -    Unfortunately, eliminating the recursion comes with a performance
>     -    penaly: 'git name-rev --all' tends to be between 15-20% slower than
>     -    before.
>     +    the recursion is gone, so flip it to expect success.  Also gone are
>     +    the dmesg entries logging the segfault of that segfaulting 'git
>     +    name-rev' process on every execution of the test suite.
>
>          Note that this slightly changes the order of lines in the output of
>          'git name-rev --all', usually swapping two lines every 35 lines in
>     @@ Commit message
>
>          This patch is best viewed with '--ignore-all-space'.
>
>     +    [1] Early versions of this patch used a 'commit_list', resulting in
>     +        ~15% performance penalty for 'git name-rev --all' in 'linux.git',
>     +        presumably because of the memory allocation and release for each
>     +        insertion and removal. Using a LIFO 'prio_queue' has basically no
>     +        effect on performance.
>     +
>     +    [2] We prefer shorter names, i.e. 'v0.1~234' is preferred over
>     +        'v0.1^2~5', meaning that usually following the first parent of a
>     +        merge results in the best name for its ancestors.  So when later
>     +        we follow the remaining parent(s) of a merge, and reach an already
>     +        named commit, then we usually find that we can't give that commit
>     +        a better name, and thus we don't have to visit any of its
>     +        ancestors again.
>     +
>     +        OTOH, if we were to follow the Nth parent of the merge first, then
>     +        the name of all its ancestors would include a corresponding '^N'.
>     +        Those are not the best names for those commits, so when later we
>     +        reach an already named commit following the first parent of that
>     +        merge, then we would have to update the name of that commit and
>     +        the names of all of its ancestors as well.  Consequently, we would
>     +        have to visit many commits several times, resulting in a
>     +        significant slowdown.
>     +
>          Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
>
>       ## builtin/name-rev.c ##
>     +@@
>     + #include "tag.h"
>     + #include "refs.h"
>     + #include "parse-options.h"
>     ++#include "prio-queue.h"
>     + #include "sha1-lookup.h"
>     + #include "commit-slab.h"
>     +
>      @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit *commit,
>       		return NULL;
>       }
>     @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit
>      -		parse_commit(parent);
>      -		if (parent->date < cutoff)
>      -			continue;
>     -+	struct commit_list *list = NULL;
>     ++	struct prio_queue queue;
>     ++	struct commit *commit;
>     ++	struct commit **parents_to_queue = NULL;
>     ++	size_t parents_to_queue_nr, parents_to_queue_alloc = 0;
>      +
>     -+	commit_list_insert(start_commit, &list);
>     ++	memset(&queue, 0, sizeof(queue)); /* Use the prio_queue as LIFO */
>     ++	prio_queue_put(&queue, start_commit);
>      +
>     -+	while (list) {
>     -+		struct commit *commit = pop_commit(&list);
>     ++	while ((commit = prio_queue_get(&queue))) {
>      +		struct rev_name *name = get_commit_rev_name(commit);
>     -+		struct commit_list *parents, *new_parents = NULL;
>     -+		struct commit_list **last_new_parent = &new_parents;
>     ++		struct commit_list *parents;
>      +		int parent_number = 1;
>      +
>     ++		parents_to_queue_nr = 0;
>     ++
>      +		for (parents = commit->parents;
>      +				parents;
>      +				parents = parents->next, parent_number++) {
>     @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit
>      -			distance = name->distance + 1;
>      +			if (create_or_update_name(parent, new_name, taggerdate,
>      +						  generation, distance,
>     -+						  from_tag))
>     -+				last_new_parent = commit_list_append(parent,
>     -+						  last_new_parent);
>     ++						  from_tag)) {
>     ++				ALLOC_GROW(parents_to_queue,
>     ++					   parents_to_queue_nr + 1,
>     ++					   parents_to_queue_alloc);
>     ++				parents_to_queue[parents_to_queue_nr] = parent;
>     ++				parents_to_queue_nr++;
>     ++			}
>       		}
>
>      -		if (create_or_update_name(parent, new_name, taggerdate,
>      -					  generation, distance,
>      -					  from_tag))
>      -			name_rev(parent, new_name, taggerdate, from_tag);
>     -+		*last_new_parent = list;
>     -+		list = new_parents;
>     ++		/* The first parent must come out first from the prio_queue */
>     ++		while (parents_to_queue_nr)
>     ++			prio_queue_put(&queue,
>     ++				       parents_to_queue[--parents_to_queue_nr]);
>       	}
>     ++
>     ++	clear_prio_queue(&queue);
>     ++	free(parents_to_queue);
>       }
>
>     + static int subpath_matches(const char *path, const char *filter)
>
>       ## t/t6120-describe.sh ##
>      @@ t/t6120-describe.sh: test_expect_success 'describe tag object' '
> 13:  1ef69550ca ! 13:  316f7af43c name-rev: cleanup name_ref()
>     @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit
>      -		int from_tag)
>      +		int from_tag, int deref)
>       {
>     - 	struct commit_list *list = NULL;
>     + 	struct prio_queue queue;
>     + 	struct commit *commit;
>     + 	struct commit **parents_to_queue = NULL;
>     + 	size_t parents_to_queue_nr, parents_to_queue_alloc = 0;
>      +	char *to_free = NULL;
>      +
>      +	parse_commit(start_commit);
>     @@ builtin/name-rev.c: static struct rev_name *create_or_update_name(struct commit
>      +		return;
>      +	}
>
>     - 	commit_list_insert(start_commit, &list);
>     -
>     + 	memset(&queue, 0, sizeof(queue)); /* Use the prio_queue as LIFO */
>     + 	prio_queue_put(&queue, start_commit);
>      @@ builtin/name-rev.c: static int name_ref(const char *path, const struct object_id *oid, int flags, vo
>       		if (taggerdate == TIME_MAX)
>       			taggerdate = commit->date;
> 14:  9d513b3092 <  -:  ---------- name-rev: plug a memory leak in name_rev()
> 15:  8489abb62e <  -:  ---------- name-rev: plug a memory leak in name_rev() in the deref case
> --
> 2.24.0.388.gde53c094ea
>
>

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux