Re: [PATCH v4 4/6] revision: implement sparse algorithm

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sat, 15 Dec 2018 00:32:44 +0100

On Fri, Dec 14 2018, Derrick Stolee via GitGitGadget wrote:

> Despite these potential drawbacks, the benefits of the algorithm
> are clear. By adding a counter to 'add_children_by_path' and
> 'mark_tree_contents_uninteresting', I measured the number of
> parsed trees for the two algorithms in a variety of repos.

Ah, so we take one or the other. I tested this with:

    diff --git a/revision.c b/revision.c
    index 63bf6230dc..e9c67aa550 100644
    --- a/revision.c
    +++ b/revision.c
    @@ -61,6 +61,8 @@ static void mark_tree_contents_uninteresting(struct repository *r,
            struct tree_desc desc;
            struct name_entry entry;

    +       fprintf(stderr, "MTCU\n");
    +
            if (parse_tree_gently(tree, 1) < 0)
                    return;

    @@ -166,6 +168,8 @@ static void add_children_by_path(struct repository *r,
            struct tree_desc desc;
            struct name_entry entry;

    +       fprintf(stderr, "ACBP\n");
    +
            if (!tree)
                    return;

And tried testing my pushing this branch to my git.git:

    $ for v in true false; do git push --delete avar push/sparse; ./git -c pack.usesparse=$v --exec-path=$PWD push avar HEAD 2>&1 | grep -e MTCU -e ACBP | uniq -c; done
    To github.com:avar/git.git
     - [deleted]               push/sparse
         22 ACBP
    To github.com:avar/git.git
     - [deleted]               push/sparse
        184 MTCU

But snipping around a bit from the commit message:

>When enumerating objects to place in a pack-file during 'git
>pack-objects --revs', we discover the "frontier" of commits
>that we care about and the boundary with commit we find
>uninteresting. From that point, we walk trees to discover which
>trees and blobs are uninteresting. Finally, we walk trees from the
>interesting commits to find the interesting objects that are
>placed in the pack.
>[...]
>These improvements will have even larger benefits in the super-
>large Windows repository. In our experiments, we see the
>"Enumerate objects" phase of pack-objects taking 60-80% of the
>end-to-end time of non-trivial pushes, taking longer than the
>network time to send the pack and the server time to verify the
>pack.

If I change my monkeypatch to:

    diff --git a/revision.c b/revision.c
    index 63bf6230dc..9171ca27c5 100644
    --- a/revision.c
    +++ b/revision.c
    @@ -63,0 +64,3 @@ static void mark_tree_contents_uninteresting(struct repository *r,
    +       fprintf(stderr, "MTCU\n");
    +       sleep(1);
    +
    @@ -168,0 +172,3 @@ static void add_children_by_path(struct repository *r,
    +       fprintf(stderr, "ACBP\n");
    +       sleep(1);
    +

We spend a long time printing those out before we ever get to
"Enumerating objects".

Which was where I was trying to test this, i.e. is this a lot of work we
perform before we print out the progress bar, and regardless of this
optimization should have other progress output there, so we can see this
time we're spending on this?