Re: Runaway sed memory use in test on older sed+glibc (was "Re: [PATCH v6 1/3] test: add helper functions for git-bundle")

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Thu, 27 May 2021 14:19:04 +0200

On Thu, May 27 2021, Jiang Xin wrote:

> Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> 于2021年5月27日周四
> 上午2:51写道：
>>
>>
>> On Mon, Jan 11 2021, Jiang Xin wrote:
>>
>> > From: Jiang Xin <zhiyou.jx@xxxxxxxxxxxxxxx>
>> >
>> > Move git-bundle related functions from t5510 to a library, and this
>> > lib
>> > will be shared with a new testcase t6020 which finds a known
>> > breakage of
>> > "git-bundle".
>> > [...]
>> > +
>> > +# Format the output of git commands to make a user-friendly and
>> > stable
>> > +# text.  We can easily prepare the expect text without having to
>> > worry
>> > +# about future changes of the commit ID and spaces of the output.
>> > +make_user_friendly_and_stable_output () {
>> > +     sed \
>> > +             -e "s/${A%${A#???????}}[0-9a-f]*/<COMMIT-A>/g" \
>> > +             -e "s/${B%${B#???????}}[0-9a-f]*/<COMMIT-B>/g" \
>> > +             -e "s/${C%${C#???????}}[0-9a-f]*/<COMMIT-C>/g" \
>> > +             -e "s/${D%${D#???????}}[0-9a-f]*/<COMMIT-D>/g" \
>> > +             -e "s/${E%${E#???????}}[0-9a-f]*/<COMMIT-E>/g" \
>> > +             -e "s/${F%${F#???????}}[0-9a-f]*/<COMMIT-F>/g" \
>> > +             -e "s/${G%${G#???????}}[0-9a-f]*/<COMMIT-G>/g" \
>> > +             -e "s/${H%${H#???????}}[0-9a-f]*/<COMMIT-H>/g" \
>> > +             -e "s/${I%${I#???????}}[0-9a-f]*/<COMMIT-I>/g" \
>> > +             -e "s/${J%${J#???????}}[0-9a-f]*/<COMMIT-J>/g" \
>> > +             -e "s/${K%${K#???????}}[0-9a-f]*/<COMMIT-K>/g" \
>> > +             -e "s/${L%${L#???????}}[0-9a-f]*/<COMMIT-L>/g" \
>> > +             -e "s/${M%${M#???????}}[0-9a-f]*/<COMMIT-M>/g" \
>> > +             -e "s/${N%${N#???????}}[0-9a-f]*/<COMMIT-N>/g" \
>> > +             -e "s/${O%${O#???????}}[0-9a-f]*/<COMMIT-O>/g" \
>> > +             -e "s/${P%${P#???????}}[0-9a-f]*/<COMMIT-P>/g" \
>> > +             -e "s/${TAG1%${TAG1#???????}}[0-9a-f]*/<TAG-1>/g" \
>> > +             -e "s/${TAG2%${TAG2#???????}}[0-9a-f]*/<TAG-2>/g" \
>> > +             -e "s/${TAG3%${TAG3#???????}}[0-9a-f]*/<TAG-3>/g" \
>> > +             -e "s/ *\$//"
>> > +}
>>
>> On one of the gcc farm boxes, a i386 box (gcc45) this fails because
>> sed
>> gets killed after >500MB of memory use (I was just eyeballing it in
>> htop) on the "reate bundle from special rev: main^!" test. This with
>> GNU
>> sed 4.2.2.
>>
>> I suspect this regex pattern creates some runaway behavior in sed
>> that's
>> since been fixed (or maybe it's the glibc regex engine?). The glibc is
>> 2.19-18+deb8u10:
>>
>>     + git bundle list-heads special-rev.bdl
>>     + make_user_friendly_and_stable_output
>>     + sed -e s/[0-9a-f]*/<COMMIT-A>/g -e s/[0-9a-f]*/<COMMIT-B>/g -e
>> s/[0-9a-f]*/<COMMIT-C>/g -e s/[0-9a-f]*/<COMMIT-D>/g -e
>> s/[0-9a-f]*/<COMMIT-E>/g -e s/[0-9a-f]*/<COMMIT-F>/g -e
>> s/[0-9a-f]*/<COMMIT-G>/g -e s/[0-9a-f]*/<COMMIT-H>/g -e
>> s/[0-9a-f]*/<COMMIT-I>/g -e s/[0-9a-f]*/<COMMIT-J>/g -e
>> s/[0-9a-f]*/<COMMIT-K>/g -e s/[0-9a-f]*/<COMMIT-L>/g -e
>> s/[0-9a-f]*/<COMMIT-M>/g -e s/[0-9a-f]*/<COMMIT-N>/g -e
>> s/[0-9a-f]*/<COMMIT-O>/g -e s/[0-9a-f]*/<COMMIT-P>/g -e
>> s/[0-9a-f]*/<TAG-1>/g -e s/[0-9a-f]*/<TAG-2>/g -e
>> s/[0-9a-f]*/<TAG-3>/g -e s/ *$//
>>     sed: couldn't re-allocate memory
>
> I wrote a program on macOS to check memory footprint for sed and perl.
> See:
>
>     https://github.com/jiangxin/compare-sed-perl

Interesting use of Go for as a /usr/bin/time -v replacement :)

After changing your int64 to int32 and digging up how to cross-compile
Go I get similar results, it's because your test has actual short SHA-1s
in the "-e 's///g'"'s, but notice how in the trace I have it's
e.g. "s/[0-9a-f]*/<COMMIT-A>/g".

That's the problem, so that Go command won't reproduce it. Anyway,
changing the test to emit to "input" first and running this shows it:

    avar@gcc45:/run/user/1632/git/t/trash directory.t6020-bundle-misc$ /usr/bin/time -v sed -e 's/[0-9a-f]*/<COMMIT-A>/g' -e 's/[0-9a-f]*/<COMMIT-B>/g' -e 's/[0-9a-f]*/<COMMIT-C>/g' -e 's/[0-9a-f]*/<COMMIT-D>/g' -e 's/[0-9a-f]*/<COMMIT-E>/g' -e 's/[0-9a-f]*/<COMMIT-F>/g' -e 's/[0-9a-f]*/<COMMIT-G>/g' -e 's/[0-9a-f]*/<COMMIT-H>/g' -e 's/[0-9a-f]*/<COMMIT-I>/g' -e 's/[0-9a-f]*/<COMMIT-J>/g' -e 's/[0-9a-f]*/<COMMIT-K>/g' -e 's/[0-9a-f]*/<COMMIT-L>/g' -e 's/[0-9a-f]*/<COMMIT-M>/g' -e 's/[0-9a-f]*/<COMMIT-N>/g' -e 's/[0-9a-f]*/<COMMIT-O>/g' -e 's/[0-9a-f]*/<COMMIT-P>/g' -e 's/[0-9a-f]*/<TAG-1>/g' -e 's/[0-9a-f]*/<TAG-2>/g' -e 's/[0-9a-f]*/<TAG-3>/g' -e 's/ *$//' <input
    sed: couldn't re-allocate memory
    Command exited with non-zero status 4
            Command being timed: "sed -e s/[0-9a-f]*/<COMMIT-A>/g -e s/[0-9a-f]*/<COMMIT-B>/g -e s/[0-9a-f]*/<COMMIT-C>/g -e s/[0-9a-f]*/<COMMIT-D>/g -e s/[0-9a-f]*/<COMMIT-E>/g -e s/[0-9a-f]*/<COMMIT-F>/g -e s/[0-9a-f]*/<COMMIT-G>/g -e s/[0-9a-f]*/<COMMIT-H>/g -e s/[0-9a-f]*/<COMMIT-I>/g -e s/[0-9a-f]*/<COMMIT-J>/g -e s/[0-9a-f]*/<COMMIT-K>/g -e s/[0-9a-f]*/<COMMIT-L>/g -e s/[0-9a-f]*/<COMMIT-M>/g -e s/[0-9a-f]*/<COMMIT-N>/g -e s/[0-9a-f]*/<COMMIT-O>/g -e s/[0-9a-f]*/<COMMIT-P>/g -e s/[0-9a-f]*/<TAG-1>/g -e s/[0-9a-f]*/<TAG-2>/g -e s/[0-9a-f]*/<TAG-3>/g -e s/ *$//"
            User time (seconds): 130.00
            System time (seconds): 2.42
            Percent of CPU this job got: 100%
            Elapsed (wall clock) time (h:mm:ss or m:ss): 2:12.41
            Average shared text size (kbytes): 0
            Average unshared data size (kbytes): 0
            Average stack size (kbytes): 0
            Average total size (kbytes): 0
            Maximum resident set size (kbytes): 1030968
            Average resident set size (kbytes): 0
            Major (requiring I/O) page faults: 0
            Minor (reclaiming a frame) page faults: 257333
            Voluntary context switches: 1
            Involuntary context switches: 12578
            Swaps: 0
            File system inputs: 0
            File system outputs: 0
            Socket messages sent: 0
            Socket messages received: 0
            Signals delivered: 0
            Page size (bytes): 4096
            Exit status: 4

But no, the issue as it turns out is not Perl v.s. Sed, it's that
there's some bug in the shellscript / tooling version (happens with both
dash 0.5.7-4 and bash 4.3-11+deb8u2 on that box) where those expansions
like ${A%${A#??????0?}} resolve to nothing.

So if we make that:

        cat >input &&
        cat input >&2 &&
        sed -e "s/${A%${A#??????0?}}[0-9a-f]*/<COMMIT-A>/g" <input >input.tmp && mv input.tmp input &&
        cat input >&2 &&
        sed -e "s/${B%${B#???????}}[0-9a-f]*/<COMMIT-B>/g" <input >input.tmp && mv input.tmp input &&
        cat input >&2 &&

We get things like:

    + sed -e s/[0-9a-f]*/<COMMIT-A>/g
    + mv input.tmp input
    + cat input
    <COMMIT-A> <COMMIT-A>r<COMMIT-A>s<COMMIT-A>/<COMMIT-A>h<COMMIT-A>s<COMMIT-A>/<COMMIT-A>m<COMMIT-A>i<COMMIT-A>n<COMMIT-A>
    + sed -e s/[0-9a-f]*/<COMMIT-B>/g
    + mv input.tmp input
    + cat input
    <COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B> <COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B>r<COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B>s<COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B>/<COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B>h<COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B>s<COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B>/<COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B>m<COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B>i<COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B>n<COMMIT-B><<COMMIT-B>C<COMMIT-B>O<COMMIT-B>M<COMMIT-B>M<COMMIT-B>I<COMMIT-B>T<COMMIT-B>-<COMMIT-B>A<COMMIT-B>><COMMIT-B>
    [...]

etc. I.e. it's the sed expression itself that's the issue. I.e. you
should be able to reproduce this locally with something like:

    echo 0 | sed -e 's/[0-9]*/<BEGIN>0<END>/g' -e 's/[0-9]*/<BEGIN>0<END>/g' -e 's/[0-9]*/<BEGIN>0<END>/g' -e 's/[0-9]*/<BEGIN>0<END>/g' -e 's/[0-9]*/<BEGIN>0<END>/g' -e 's/[0-9]*/<BEGIN>0<END>/g' -e 's/[0-9]*/<BEGIN>0<END>/g' -e 's/[0-9]*/<BEGIN>0<END>/g'

If not just copy the -e a few more times.

Anyway, looking at this whole test file with fresh eyes this pattern
seems very strange. You duplicated most of test_commit with this
test_commit_setvar. It's a bit more verbosity but why not just use:

    test_commit ...
    A=$(git rev-parse HEAD)

Or teach test_commit a --rev-parse option or something and:

    A=$(test_commit ...)

This make_user_friendly_and_stable_output then actually loses
information, e.g. sometimes the bundle output you're testing emits
trailing spaces, but the normalization function overzelously trims that.

I think this whole thing would be much simpler with the above and then
something like:

    @@ -146,7 +126,8 @@ test_expect_success 'setup' '

            # branch main: merge commit I & J
            git checkout main &&
    -       test_commit_setvar --merge I topic/1 "Merge commit I" &&
    +       git merge --no-edit --no-ff -m"Merge commit I" topic/1 &&
    +       I=$(git rev-parse HEAD) &&
            test_commit_setvar --merge J refs/pull/2/head "Merge commit J" &&

            # branch main: commit K
    @@ -172,18 +153,18 @@ test_expect_success 'create bundle from special rev: main^!' '

            git bundle list-heads special-rev.bdl |
                    make_user_friendly_and_stable_output >actual &&
    -       cat >expect <<-\EOF &&
    -       <COMMIT-P> refs/heads/main
    +       cat >expect <<-EOF &&
    +       $P refs/heads/main
            EOF
            test_cmp expect actual &&

Or just add a --merge option to test_commit itself.