Hi Phillip, On Wed, 16 Nov 2022, Phillip Wood wrote: > On 15/11/2022 18:20, Johannes Schindelin via GitGitGadget wrote: > > From: Johannes Schindelin <johannes.schindelin@xxxxxx> > > > > Internally, the `git range-diff` command spawns a `git log` process and > > parses its output for the given commit ranges. > > > > This works well when the patches that need to be compared are present in > > the local repository in the form of commits. > > > > In scenarios where that is not the case, the `range-diff` command is > > currently less helpful. > > > > The Git mailing list is such a scenario: Instead of using Git to > > exchange commits, the patches are sent there as plain-text and no commit > > range can be specified to let `range-diff` consume those patches. > > > > Instead, the expectation is to download the mails, apply them locally > > and then use `range-diff`. This can be quite cumbersome e.g. when a > > suitable base revision has to be found first where the patch applies > > cleanly. > > That's a good motivation for this change. Side note (potentially fun to know): A couple of weeks ago, I wanted to work on this because I was asked to review a new patch series iteration on the Git mailing list (I forgot which one it was, precisely), which came without a range-diff, and I wanted to download both iterations using a variation of https://github.com/git-for-windows/build-extra/blob/HEAD/apply-from-lore.sh and then use the mbox mode of `range-diff` on them. Unfortunately, I ran out of time back then (it's a common theme for me these days), and the patch series (rightfully) advanced to `next` without my review, and I let this here patch slide. > > +static inline int strtost(char const *s, size_t *result, const char **end) > > +{ > > + unsigned long u; > > + char *p; > > + > > + errno = 0; > > + /* negative values would be accepted by strtoul */ > > + if (*s == '-') > > + return -1; > > I think it is right to treat the input as untrusted and so look for malformed > hunk headers. However This test is not sufficient for that, we expect a digit > so I think > > if (!isdigit(*s)) > return -1; > > would be safer. The use of strtoul() looks good as we set errno to zero before > the call and check both errno and endp afterwards. Good point. As you might have guessed, this function is a copy/edit of existing code, in this instance `strtoul_ui()`. I have made the change you suggested. > > > + u = strtoul(s, &p, 10); > > + if (errno || p == s) > > + return -1; > > + if (result) > > + *result = u; > > + *end = p; > > + > > + return 0; > > +} > > + > > +static int parse_hunk_header(const char *p, > > + size_t *old_count, size_t *new_count, > > + const char **end) > > +{ > > + size_t o = 1, n = 1; > > + > > + if (!skip_prefix(p, "@@ -", &p) || > > + strtost(p, NULL, &p) || > > + (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) || > > It took me a minute to understand the double negatives but it is correctly > checking if we have ' ' or ',<digits>' I actually hesitated when I wrote this, and only kept the code as-is because I remembered how often Junio uses double negatives and thought it would be a good joke to do the same here. Joke's obviously on me, though. I've hence changed it to: /* The range is -<start>[,<count>], defaulting to count = 1 */ !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) || and /* The range is * +<start>[,<count>], * defaulting to count = 1 */ !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) || > > > + !skip_prefix(p, " +", &p) || > > + strtost(p, NULL, &p) || > > + (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) || > > + !skip_prefix(p, " @@", &p)) > > + return -1; > > + > > + *old_count = o; > > + *new_count = n; > > + *end = p; > > + > > + return 0; > > +} > > + > > +static inline int find_eol(const char *line, size_t size) > > +{ > > + char *eol; > > + > > + eol = memchr(line, '\n', size); > > + if (!eol) > > + return size; > > + > > + if (eol != line && eol[-1] == '\r') > > + eol[-1] = '\0'; > > + else > > + *eol = '\0'; > > + > > + return eol + 1 - line; > > We return the offset to the start of the next line, not the length of the > line. This will be important later. You're right. I changed the name to `find_next_line()` and added an informative comment on top of the function. > > > +} > > + > > +static int read_mbox(const char *path, struct string_list *list) > > +{ > > + struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT; > > + struct strbuf long_subject = STRBUF_INIT; > > + struct patch_util *util = NULL; > > + enum { > > + MBOX_BEFORE_HEADER, > > + MBOX_IN_HEADER, > > + MBOX_IN_COMMIT_MESSAGE, > > + MBOX_AFTER_TRIPLE_DASH, > > + MBOX_IN_DIFF > > + } state = MBOX_BEFORE_HEADER; > > + char *line, *current_filename = NULL; > > + int len; > > + size_t size, old_count = 0, new_count = 0; > > + const char *author = NULL, *subject = NULL; > > + > > + if (!strcmp(path, "-")) { > > + if (strbuf_read(&contents, STDIN_FILENO, 0) < 0) > > + return error_errno(_("could not read stdin")); > > + } else if (strbuf_read_file(&contents, path, 0) < 0) > > + return error_errno(_("could not read '%s'"), path); > > + > > + line = contents.buf; > > + size = contents.len; > > + for (; size > 0; size -= len, line += len) { > > size is unsigned so we're effectively testing 'size != 0' which means if we're > off by one somewhere we'll have an undetected buffer overflow. Using a signed > type wouldn't prevent the buffer overflow but it would limit its extent. `contents.len` is of type `size_t`, though, and I'd like to stay consistent. I did remove the misleading `> 0` from the condition, though. > > > + const char *p; > > + > > + len = find_eol(line, size); > > Here len is not the length of line if it originally ended "\r\n". > > > + if (state == MBOX_BEFORE_HEADER) { > > + if (!skip_prefix(line, "From ", &p)) > > + continue; > > + > > + util = xcalloc(1, sizeof(*util)); > > + if (get_oid_hex(p, &util->oid) < 0) > > + oidcpy(&util->oid, null_oid()); > > + util->matching = -1; > > + author = subject = NULL; > > + > > + state = MBOX_IN_HEADER; > > I wondered if there should there be a `continue;` here but I think it probably > needs to "fall-through" to the MBOX_IN_HEADER handling below. A comment to > clarify that would be helpful. You're absolutely correct, there totally should be a `continue` here: we saw a `From `, and in the `MBOX_IN_HEADER` we look for either an empty line, the `From:` header or the `Subject:` header, so we know that the current line cannot match, no need to fall through. > > > + } > > + > > + if (starts_with(line, "diff --git ")) { > > + struct patch patch = { 0 }; > > + struct strbuf root = STRBUF_INIT; > > + int linenr = 0; > > + int orig_len; > > + > > + state = MBOX_IN_DIFF; > > + old_count = new_count = 0; > > + strbuf_addch(&buf, '\n'); > > + if (!util->diff_offset) > > + util->diff_offset = buf.len; > > + line[len - 1] = '\n'; > > Here the line will still be NUL terminated if it originally ended "\r\n" which > presumably messes up the call to parse_git_diff_header() below. I have not > checked if parse_git_diff_header() can handle "\r\n" when it is parsing the > rest of the diff header. I changed it locally to reinstate the `\r\n`, only to figure out that almost the entire `apply.c` machinery totally falls over CR/LF line endings. I changed the code to detect a Carriage Return and error out if one is detected. > > > + orig_len = len; > > + len = parse_git_diff_header(&root, &linenr, 1, line, > > + len, size, &patch); > > + if (len < 0) { > > + error(_("could not parse git header '%.*s'"), > > + orig_len, line); > > + free(util); > > + free(current_filename); > > + string_list_clear(list, 1); > > + strbuf_release(&buf); > > + strbuf_release(&contents); > > + strbuf_release(&long_subject); > > + return -1; > > + } > > + > > + if (patch.old_name) > > + skip_prefix(patch.old_name, "a/", > > + (const char **)&patch.old_name); > > + if (patch.new_name) > > + skip_prefix(patch.new_name, "b/", > > + (const char **)&patch.new_name); > > I think this is fine for now but we might want to support other prefixes in > the future. If it is not a copy or rename then the filename can be deduced by > finding the common tail of patch.old_name and patch.new_name and stripping > anything before the first '/'. If it is a copy or rename then I suspect there > is no prefix (though I've not checked) Since `skip_prefix()` does not do anything if there is no match, I think it is sane to err by not stripping anything unless the expected `a/` and `b/` prefixes are seen. > > > + strbuf_addstr(&buf, " ## "); > > + if (patch.is_new > 0) > > `patch.is_now` and `patch.is_delete` are booleans like `patch.is_rename` so we > don't need the '> 0' Good catch. > > > + strbuf_addf(&buf, "%s (new)", patch.new_name); > > + else if (patch.is_delete > 0) > > + strbuf_addf(&buf, "%s (deleted)", > > patch.old_name); > > + else if (patch.is_rename) > > + strbuf_addf(&buf, "%s => %s", patch.old_name, > > patch.new_name); > > + else > > + strbuf_addstr(&buf, patch.new_name); > > + > > + free(current_filename); > > + if (patch.is_delete > 0) > > + current_filename = xstrdup(patch.old_name); > > + else > > + current_filename = xstrdup(patch.new_name); > > + > > + if (patch.new_mode && patch.old_mode && > > + patch.old_mode != patch.new_mode) > > + strbuf_addf(&buf, " (mode change %06o => > > %06o)", > > + patch.old_mode, patch.new_mode); > > + > > + strbuf_addstr(&buf, " ##\n"); > > + util->diffsize++; > > + } else if (state == MBOX_IN_HEADER) { > > + if (!line[0]) { > > + state = MBOX_IN_COMMIT_MESSAGE; > > + /* Look for an in-body From: */ > > + if (size > 5 && skip_prefix(line + 1, "From: > > ", &p)) { > > The "size > 5" seems a bit unnecessary as we're using skip_prefix() Right! > > > + size -= p - line; > > + line += p - line; > > This is good, we're accounting for reading the next line. > > > + len = find_eol(line, size); > > + > > + while (isspace(*p)) > > + p++; > > + author = p; > > + } > > + strbuf_addstr(&buf, " ## Metadata ##\n"); > > + if (author) > > + strbuf_addf(&buf, "Author: %s\n", > > author); > > + strbuf_addstr(&buf, "\n ## Commit message > > ##\n"); > > + if (subject) > > + strbuf_addf(&buf, " %s\n\n", > > subject); > > + } else if (skip_prefix(line, "From: ", &p)) { > > + while (isspace(*p)) > > + p++; > > + author = p; > > + } else if (skip_prefix(line, "Subject: ", &p)) { > > + const char *q; > > + > > + while (isspace(*p)) > > + p++; > > + subject = p; > > + > > + if (starts_with(p, "[PATCH") && > > + (q = strchr(p, ']'))) { > > + q++; > > + while (isspace(*q)) > > + q++; > > + subject = q; > > + } > > + > > + if (len < size && line[len] == ' ') { > > + /* handle long subject */ > > + strbuf_reset(&long_subject); > > + strbuf_addstr(&long_subject, subject); > > + while (len < size && line[len] == ' ') > > { > > + line += len; > > + size -= len; > > + len = find_eol(line, size); > > + strbuf_addstr(&long_subject, > > line); > > Looks good > > > + } > > + subject = long_subject.buf; > > + } > > + } > > + } else if (state == MBOX_IN_COMMIT_MESSAGE) { > > + if (!*line) > > Not a big issue elsewhere you've used "!line[0]" > Style: there should be braces on this branch. Fixed on both accounts. > [... addressed in follow-up mail...] > > I think this is a useful addition, it could perhaps benefit from more tests > though. Having tests for bad input, "\r\n" line endings and getting the author > from a From: header as well as an in-body From: line would give a bit more > reassurance about how robust the parser is. Good point, I've augmented the test case a bit more. Thank you again. It is a real pleasure to receive these constructive reviews from you that pay so much attention to detail, and always lead to a clear path forward. Thanks! Dscho