Re: git format-patch escaping issues in the patch format

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 4, 2024, at 20:24, Christoph Anton Mitterer wrote:
> Hey there.
>
> For some special use case, I wanted to write a parser for the patch
> format created by git format-patch, especially where I can separate
> headers, commit message and the actual unified diffs.
>
> There seems unfortunately only little (written) definition of that
> format, git-format-patch(1) merely says it's in UNIX mailbox format
> (which itself is, AFAIK, not really formally defined).

It seems to me (totally naïve) that you would do something like

1. Blank line terminates headers
2. Then there might be some optional commit-headers which can override
   things (`From`)
3. Commit message
4. `---`
5. Look for a regex `^diff` line
   • Now the indentation will tell you when it ends
6. `^Range-diff` and `^Interdiff` can also make an appearance in this
   section

> Anyway, it seems to turn out, that no escaping is done for the commit
> message in the patch format and that this can cause actual breakage
> with valid commit messages.
>
> Consider the following example:
> 1. I create a fresh repo, add a test file and use a commit message,
>    which contains a From line (even with the "magic" timestamp) and
>    some made up commit id (0000...)
>
>    ~/test$ git init foo; cd foo
>    Initialized empty Git repository in /home/calestyo/test/foo/.git/
>    ~/test/foo$ echo a >f; git add f
>    ~/test/foo$ git commit -m "msg1
>
>    From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
>    --
>    ---"
>    [master (root-commit) c08debc] msg1
>     1 file changed, 1 insertion(+)
>     create mode 100644 f

At first I thought that it is a magic string for a reason.  But there
could be an organic use-case here.  Like someone who is merely
code-quoting the magic string.  And also using code fences.

    Parse the magic mbox string

    Turns out that Git has something weird: a magic string to delimit emails
    or something.  Look here:

    ```
    From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
    ```

    It will look exactly like this except all those numbers in the middle.

    Detecting this is all we need to parse that 80-message file that Pedro
    sent us of our school project backup.

This will not apply because “Patch is empty”.

But if I change the commit message to use indentation for the
code block:

    Turns out that Git has something weird: a magic string to delimit emails
    or something.  Look here:

        From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001

It does apply.

As for the `---`: it seems natural for people to use this as a section
break.  And I do end up needing that from time to time (maybe once a
year).  For people who stick to ASCII (`-` times 5):

    Detecting this is all we need to parse that 80-message file that Pedro
    sent us of our school project backup.

    -----

    We should look into Pedro’s backup system.  Why does he use email?

Or `***`.

> 3. Adding a 2nd commit, this time using the unified diff from the above
>    patch as commit message body(!).
>
>    ~/test/foo$ echo b >>f; git add f
>    ~/test/foo$ git commit -m "msg2
>
>    diff --git a/f b/f
>    new file mode 100644
>    index 0000000..7898192
>    --- /dev/null
>    +++ b/f
>    @@ -0,0 +1 @@
>    +a
>    --
>    2.45.2"
>    [master 6bbe38c] msg2
>     1 file changed, 1 insertion(+)
>    ~/test/foo$ git format-patch --root
>    0001-msg1.patch
>    0002-msg2.patch
>
>
> 4. To no surprise, git itself of course knows the difference between
>    commit message and actual patch, as show e.g. by the following,
>    where the commit message is indented (by git):

This has happened out in the wild before.[1]  And the advice was the same
as what you’re doing here.

🔗 1: https://lore.kernel.org/git/16297305.cDA1TJNmNo@earendil/#t

>
>    $ git log --patch | cat
>    commit 6bbe38c33680239ac9767e0e5095f9f32ad41ade
>    Author: Christoph Anton Mitterer <mail@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
>    Date:   Mon Nov 4 20:00:20 2024 +0100
>
>        msg2
>
>        diff --git a/f b/f
>        new file mode 100644
>        index 0000000..7898192
>        --- /dev/null
>        +++ b/f
>        @@ -0,0 +1 @@
>        +a
>        --
>        2.45.2
>
>    diff --git a/f b/f
>    index 7898192..422c2b7 100644
>    --- a/f
>    +++ b/f
>    @@ -1 +1,2 @@
>     a
>    +b
>
>    commit c08debcc502c78786ec71d50686ff0445a13b654
>    Author: Christoph Anton Mitterer <mail@xxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
>    Date:   Mon Nov 4 19:58:45 2024 +0100
>
>        msg1
>
>        From 0000000000000000000000000000000061603705 Mon Sep 17 00:00:00 2001
>        --
>        ---
>
>    diff --git a/f b/f
>    new file mode 100644
>    index 0000000..7898192
>    --- /dev/null
>    +++ b/f
>    @@ -0,0 +1 @@
>    +a
>
>
> 5. Next I try whether git am can use the patches created above in a
>    fresh repo:
>
>    ~/test/foo$ cd ..; git init bar; cd bar
>    Initialized empty Git repository in /home/calestyo/test/bar/.git/
>    ~/test/bar$ git am ../foo/0001-msg1.patch
>    Patch is empty.
>    hint: When you have resolved this problem, run "git am --continue".
>    hint: If you prefer to skip this patch, run "git am --skip" instead.
>    hint: To record the empty patch as an empty commit, run "git am
> --allow-empty".
>    hint: To restore the original branch and stop patching, run "git am
> --abort".
>    hint: Disable this message with "git config advice.mergeConflict
> false"
>
>    That already fails for the first patch, the reason probably being my
>       From 0000...
>    line in the commit message.
>
>
> 6. So trying again with simply that From 000.. line removed
>
>    ~/test/bar$ sed -i '/^From 00000/d' ../foo/0001-msg1.patch
>    ~/test/bar$ git am ../foo/0001-msg1.patch
>    fatal: previous rebase directory .git/rebase-apply still exists but
> mbox given.
>
>    and again on a freshly created repo:
>
>    ~/test/bar$ cd ..; rm -rf bar; git init bar; cd bar
>    Initialized empty Git repository in /home/calestyo/test/bar/.git/
>    ~/test/bar$ git am ../foo/0001-msg1.patch
>    Applying: msg1
>    applying to an empty history
>
>    Ah, now it works, so it was indeed the (unusual but still valid
> commit message).
>
>
> 7. Now that 0001-msg1.patch is applied, let's try the 2nd patch:
>
>    ~/test/bar$ git am ../foo/0002-msg2.patch
>    Applying: msg2
>    error: f: already exists in index
>    Patch failed at 0001 msg2
>    hint: Use 'git am --show-current-patch=diff' to see the failed patch
>    hint: When you have resolved this problem, run "git am --continue".
>    hint: If you prefer to skip this patch, run "git am --skip" instead.
>    hint: To restore the original branch and stop patching, run "git am --abort".
>    hint: Disable this message with "git config advice.mergeConflict false"
>    ~/test/bar$
>
>    And again, ... the reason most likely git not being able to get that
>    the "first diff" is not actually a diff but part of the commit message.

It did the right thing for me with this (last part) of the commit message:

    We should look into Pedro’s backup system.  Why does he use email?

        diff --git a/a.txt b/a.txt
        index ce01362503..a32119c8aa 100644
        --- a/a.txt
        +++ b/a.txt
        @@ -1 +1,2 @@
         hello
        +goodbye

***

The `---` part could show up in a commit message naturally.  I guess the
status quote remedy (?) is to get burned once with a failed application
and not use that as a section break any more.

It seems like it would be nice if format-patch complained if it found
regex `^---$` in the commit body.  It is not at all a magic
(obfuscated/unlikely) string.  If it crashes on that you can fix it
straight away.  Then you don’t have to wait for two email roundtrips and
an git-am(1) application.

The magic string is unlikely but could happen.  The solution is to use
an indented block.  Same for the diff.  (Hopefully few have to
code-quote diffs)

But escaping things in this format?  That has knock-on effects like
escaping-escaping.  Certainly with one-character escapes.  I mean if you
are trying to be resistant to all kinds of strings and characters, and
code blocks (with fences, not indented) can contain basically anything,
you could end up with likely collisions.  Then the format gets harder to
read.

Maybe with yet more magic strings since (unlike single chars) they are
unlikely to occur in the input:

    Detecting this is all we need to parse that 80-message file that Pedro
    sent us of our school project backup.

    (blasphemous bath tub)---

    We should look into Pedro’s backup system.  Why does he use email?

And if you really meant to write that literally:

    Detecting this is all we need to parse that 80-message file that Pedro
    sent us of our school project backup.

    (blasphemous bath tub)(blasphemous bath tub)---

    We should look into Pedro’s backup system.  Why does he use email?

But then the problem is that people have to look at that.  Just because
they wanted to use `---` for section break.  Then they’re probably just
gonna stop using it for section breaks.  Back to square one (could have
just forbidden it).

> btw and shamelessly off-topic question:
> Any chance that git format-patch / am will ever support keeping track
> of the branch/merge history of generated / applied patches?
> That would be really neat.

What does this mean more concretely?  :)

> PS: Not subscribed, so please keep me CCed in case you want me to read
>     any possible replies :-)

That’s the list policy.





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux