> fast-import has subtle differences in how it parses file paths between each > occurrence of <path> in the grammar. Many errors are suppressed or not checked, > which could lead to silent data corruption. A particularly bad case is when a > front-end sent escapes that Git doesn't recognize (e.g., hex escapes are not > supported), it would be treated as literal bytes instead of a quoted string. > > Bring path parsing into line with the documented behavior and improve > documentation to fill in missing details. Thanks for the review, Patrick. I've made several changes, which I think address your feedback. What's the protocol for adding `Reviewed-by`? Since I don't know whether I, you, or Junio add it, I've refrained from attaching your name to these patches. Changes since v1: * In fast-import: * Move `strbuf_release` outside of `parse_path_space` and `parse_path_eol`. * Retain allocations for static `strbuf`s. * Rename `allow_spaces` parameter of `parse_path` to `include_spaces`. * Extract change to neighboring comments as patch 8. * In tests: * Test `` for the root path additionally in all tests using `""`. * Pass all arguments by positional variables. * Use `local`. * Use `test_when_finished` instead of manual cleanup. * In documentation: * Better document conditions under which a path is considered quoted or unquoted. * Reword commit messages. Thalia Thalia Archibald (8): fast-import: tighten path unquoting fast-import: directly use strbufs for paths fast-import: allow unquoted empty path for root fast-import: remove dead strbuf fast-import: improve documentation for unquoted paths fast-import: document C-style escapes for paths fast-import: forbid escaped NUL in paths fast-import: make comments more precise Documentation/git-fast-import.txt | 30 +- builtin/fast-import.c | 156 ++++---- t/t9300-fast-import.sh | 617 +++++++++++++++++++++--------- 3 files changed, 541 insertions(+), 262 deletions(-) Range-diff against v1: 1: 8d9e0b25cb ! 1: e790bdf714 fast-import: tighten parsing of paths @@ Metadata Author: Thalia Archibald <thalia@xxxxxxxxxxxxx> ## Commit message ## - fast-import: tighten parsing of paths + fast-import: tighten path unquoting Path parsing in fast-import is inconsistent and many unquoting errors - are suppressed. + are suppressed or not checked. - `<path>` appears in the grammar in these places: + <path> appears in the grammar in these places: filemodify ::= 'M' SP <mode> (<dataref> | 'inline') SP <path> LF filedelete ::= 'D' SP <path> LF @@ Commit message and fast-import.c parses them in five different ways: 1. For filemodify and filedelete: - If `<path>` is a valid quoted string, unquote it; - otherwise, treat it as literal bytes (including any number of SP). + Try to unquote <path>. If it unquotes without errors, use the + unquoted version; otherwise, treat it as literal bytes to the end of + the line (including any number of SP). 2. For filecopy (source) and filerename (source): - If `<path>` is a valid quoted string, unquote it; - otherwise, treat it as literal bytes until the next SP. + Try to unquote <path>. If it unquotes without errors, use the + unquoted version; otherwise, treat it as literal bytes up to, but not + including, the next SP. 3. For filecopy (dest) and filerename (dest): - Like 1., but an unquoted empty string is an error. + Like 1., but an unquoted empty string is forbidden. 4. For ls: - If `<path>` starts with `"`, unquote it and report parse errors; - otherwise, treat it as literal bytes (including any number of SP). + If <path> starts with `"`, unquote it and report parse errors; + otherwise, treat it as literal bytes to the end of the line + (including any number of SP). 5. For ls-commit: - Unquote `<path>` and report parse errors. + Unquote <path> and report parse errors. (It must start with `"` to disambiguate from ls.) In the first three, any errors from trying to unquote a string are suppressed, so a quoted string that contains invalid escapes would be interpreted as literal bytes. For example, `"\xff"` would fail to unquote (because hex escapes are not supported), and it would instead be - interpreted as the byte sequence `"` `\` `x` `f` `f` `"`, which is + interpreted as the byte sequence '"', '\\', 'x', 'f', 'f', '"', which is certainly not intended. Some front-ends erroneously use their language's - standard quoting routine and could silently introduce escapes that would - be incorrectly parsed due to this. + standard quoting routine instead of matching Git's, which could silently + introduce escapes that would be incorrectly parsed due to this and lead + to data corruption. - The documentation states that “To use a source path that contains SP the - path must be quoted.”, so it is expected that some implementations - depend on spaces being allowed in paths in the final position. Thus we - have two documented ways to parse paths, so simplify the implementation - to that. + The documentation states “To use a source path that contains SP the path + must be quoted.”, so it is expected that some implementations depend on + spaces being allowed in paths in the final position. Thus we have two + documented ways to parse paths, so simplify the implementation to that. Now we have: 1. `parse_path_eol` for filemodify, filedelete, filecopy (dest), filerename (dest), ls, and ls-commit: - If `<path>` starts with `"`, unquote it and report parse errors; - otherwise, treat it as literal bytes (including any number of SP). - Garbage after a quoted string or an unquoted empty string are errors. - (In ls-commit, it must be quoted to disambiguate from ls.) + If <path> starts with `"`, unquote it and report parse errors; + otherwise, treat it as literal bytes to the end of the line + (including any number of SP). 2. `parse_path_space` for filecopy (source) and filerename (source): - If `<path>` starts with `"`, unquote it and report parse errors; - otherwise, treat it as literal bytes until the next SP. - It must be followed by a SP. An unquoted empty string is an error. + If <path> starts with `"`, unquote it and report parse errors; + otherwise, treat it as literal bytes up to, but not including, the + next SP. It must be followed by SP. + + There remain two special cases: The dest <path> in filecopy and rename + cannot be an unquoted empty string (this will be addressed subsequently) + and <path> in ls-commit must be quoted to disambiguate it from ls. Signed-off-by: Thalia Archibald <thalia@xxxxxxxxxxxxx> - ## Documentation/git-fast-import.txt ## -@@ Documentation/git-fast-import.txt: The value of `<path>` must be in canonical form. That is it must not: - * contain the special component `.` or `..` (e.g. `foo/./bar` and - `foo/../bar` are invalid). - --The root of the tree can be represented by an empty string as `<path>`. -+The root of the tree can be represented by a quoted empty string (`""`) -+as `<path>`. - - It is recommended that `<path>` always be encoded using UTF-8. - - ## builtin/fast-import.c ## -@@ builtin/fast-import.c: static int parse_mapped_oid_hex(const char *hex, struct object_id *oid, const ch - * - * idnum ::= ':' bigint; - * -- * Return the first character after the value in *endptr. -+ * Update *endptr to point to the first character after the value. - * - * Complain if the following character is not what is expected, - * either a space or end of the string. -@@ builtin/fast-import.c: static uintmax_t parse_mark_ref_eol(const char *p) - } - - /* -- * Parse the mark reference, demanding a trailing space. Return a -- * pointer to the space. -+ * Parse the mark reference, demanding a trailing space. Update *p to -+ * point to the first character after the space. - */ - static uintmax_t parse_mark_ref_space(const char **p) - { @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p) return mark; } @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p) +/* + * Parse the path string into the strbuf. It may be quoted with escape sequences + * or unquoted without escape sequences. When unquoted, it may only contain a -+ * space if `allow_spaces` is nonzero. ++ * space if `include_spaces` is nonzero. + */ -+static void parse_path(struct strbuf *sb, const char *p, const char **endp, int allow_spaces, const char *field) ++static void parse_path(struct strbuf *sb, const char *p, const char **endp, int include_spaces, const char *field) +{ -+ strbuf_reset(sb); + if (*p == '"') { + if (unquote_c_style(sb, p, endp)) + die("Invalid %s: %s", field, command_buf.buf); + } else { -+ if (allow_spaces) ++ if (include_spaces) + *endp = p + strlen(p); + else + *endp = strchr(p, ' '); -+ if (*endp == p) -+ die("Missing %s: %s", field, command_buf.buf); + strbuf_add(sb, p, *endp - p); + } +} @@ builtin/fast-import.c: static uintmax_t parse_mark_ref_space(const char **p) struct object_id oid; uint16_t mode, inline_data = 0; @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b) - die("Missing space after SHA1: %s", command_buf.buf); } -- strbuf_reset(&uq); + strbuf_reset(&uq); - if (!unquote_c_style(&uq, p, &endp)) { - if (*endp) - die("Garbage after path in: %s", command_buf.buf); @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b static struct strbuf uq = STRBUF_INIT; - const char *endp; -- strbuf_reset(&uq); + strbuf_reset(&uq); - if (!unquote_c_style(&uq, p, &endp)) { - if (*endp) - die("Garbage after path in: %s", command_buf.buf); @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b - const char *endp; struct tree_entry leaf; -- strbuf_reset(&s_uq); + strbuf_reset(&s_uq); - if (!unquote_c_style(&s_uq, s, &endp)) { - if (*endp != ' ') - die("Missing space after source: %s", command_buf.buf); @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b - strbuf_add(&s_uq, s, endp - s); - } + parse_path_space(&s_uq, p, &p, "source"); -+ parse_path_eol(&d_uq, p, "dest"); s = s_uq.buf; -- + - endp++; - if (!*endp) -- die("Missing dest: %s", command_buf.buf); ++ if (!p) + die("Missing dest: %s", command_buf.buf); - - d = endp; -- strbuf_reset(&d_uq); + strbuf_reset(&d_uq); - if (!unquote_c_style(&d_uq, d, &endp)) { - if (*endp) - die("Garbage after dest in: %s", command_buf.buf); - d = d_uq.buf; - } ++ parse_path_eol(&d_uq, p, "dest"); + d = d_uq.buf; memset(&leaf, 0, sizeof(leaf)); if (rename) -@@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b) +@@ builtin/fast-import.c: static void print_ls(int mode, const unsigned char *hash, const char *path) + + static void parse_ls(const char *p, struct branch *b) { ++ static struct strbuf uq = STRBUF_INIT; struct tree_entry *root = NULL; struct tree_entry leaf = {NULL}; -+ static struct strbuf uq = STRBUF_INIT; - /* ls SP (<tree-ish> SP)? <path> */ - if (*p == '"') { @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b) root->versions[1].mode = S_IFDIR; load_tree(root); @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b) - die("Garbage after path in: %s", command_buf.buf); - p = uq.buf; - } ++ strbuf_reset(&uq); + parse_path_eol(&uq, p, "path"); + p = uq.buf; tree_content_get(root, p, &leaf, 1); @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must +# Path parsing +# +# There are two sorts of ways a path can be parsed, depending on whether it is -+# the last field on the line. Additionally, ls without a <tree-ish> has a -+# special case. Test every occurrence of <path> in the grammar against every -+# error case. ++# the last field on the line. Additionally, ls without a <dataref> has a special ++# case. Test every occurrence of <path> in the grammar against every error case. +# + +# +# Valid paths at the end of a line: filemodify, filedelete, filecopy (dest), +# filerename (dest), and ls. +# -+# commit :301 from root -- modify hello.c ++# commit :301 from root -- modify hello.c (for setup) +# commit :302 from :301 -- modify $path +# commit :303 from :302 -- delete $path +# commit :304 from :301 -- copy hello.c $path @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must +# ls :305 $path +# +test_path_eol_success () { -+ test="$1" path="$2" unquoted_path="$3" ++ local test="$1" path="$2" unquoted_path="$3" + test_expect_success "S: paths at EOL with $test must work" ' ++ test_when_finished "git branch -D S-path-eol" && ++ + git fast-import --export-marks=marks.out <<-EOF >out 2>err && + blob + mark :401 @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must + hallo welt + BLOB + -+ commit refs/heads/path-eol ++ commit refs/heads/S-path-eol + mark :301 + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE + data <<COMMIT @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must + COMMIT + M 100644 :401 hello.c + -+ commit refs/heads/path-eol ++ commit refs/heads/S-path-eol + mark :302 + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE + data <<COMMIT @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must + from :301 + M 100644 :402 '"$path"' + -+ commit refs/heads/path-eol ++ commit refs/heads/S-path-eol + mark :303 + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE + data <<COMMIT @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must + from :302 + D '"$path"' + -+ commit refs/heads/path-eol ++ commit refs/heads/S-path-eol + mark :304 + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE + data <<COMMIT @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must + from :301 + C hello.c '"$path"' + -+ commit refs/heads/path-eol ++ commit refs/heads/S-path-eol + mark :305 + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE + data <<COMMIT @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must + git ls-tree $commit_r >tree_r.out && + test_cmp tree_r.exp tree_r.out && + -+ test_cmp out tree_r.exp && -+ -+ git branch -D path-eol ++ test_cmp out tree_r.exp + ' +} + @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must +# +# Valid paths before a space: filecopy (source) and filerename (source). +# -+# commit :301 from root -- modify $path ++# commit :301 from root -- modify $path (for setup) +# commit :302 from :301 -- copy $path hello2.c +# commit :303 from :301 -- rename $path hello2.c +# +test_path_space_success () { -+ test="$1" path="$2" unquoted_path="$3" ++ local test="$1" path="$2" unquoted_path="$3" + test_expect_success "S: paths before space with $test must work" ' ++ test_when_finished "git branch -D S-path-space" && ++ + git fast-import --export-marks=marks.out <<-EOF 2>err && + blob + mark :401 @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must + hello world + BLOB + -+ commit refs/heads/path-space ++ commit refs/heads/S-path-space + mark :301 + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE + data <<COMMIT @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must + COMMIT + M 100644 :401 '"$path"' + -+ commit refs/heads/path-space ++ commit refs/heads/S-path-space + mark :302 + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE + data <<COMMIT @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must + from :301 + C '"$path"' hello2.c + -+ commit refs/heads/path-space ++ commit refs/heads/S-path-space + mark :303 + committer $GIT_COMMITTER_NAME <$GIT_COMMITTER_EMAIL> $GIT_COMMITTER_DATE + data <<COMMIT @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must + + printf "100644 blob $blob\thello2.c\n" >tree_r.exp && + git ls-tree $commit_r >tree_r.out && -+ test_cmp tree_r.exp tree_r.out && -+ -+ git branch -D path-space ++ test_cmp tree_r.exp tree_r.out + ' +} + @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must +# of <path> in the grammar against all error kinds. +# +test_path_fail () { -+ what="$1" path="$2" err_grep="$3" ++ local change="$1" what="$2" prefix="$3" path="$4" suffix="$5" err_grep="$6" + test_expect_success "S: $change with $what must fail" ' + test_must_fail git fast-import <<-EOF 2>err && + blob @@ t/t9300-fast-import.sh: test_expect_success 'S: ls with garbage after sha1 must +} + +test_path_base_fail () { -+ test_path_fail 'unclosed " in '"$field" '"hello.c' "Invalid $field" -+ test_path_fail "invalid escape in quoted $field" '"hello\xff"' "Invalid $field" ++ local change="$1" prefix="$2" field="$3" suffix="$4" ++ test_path_fail "$change" 'unclosed " in '"$field" "$prefix" '"hello.c' "$suffix" "Invalid $field" ++ test_path_fail "$change" "invalid escape in quoted $field" "$prefix" '"hello\xff"' "$suffix" "Invalid $field" +} +test_path_eol_quoted_fail () { -+ test_path_base_fail -+ test_path_fail "garbage after quoted $field" '"hello.c"x' "Garbage after $field" -+ test_path_fail "space after quoted $field" '"hello.c" ' "Garbage after $field" ++ local change="$1" prefix="$2" field="$3" suffix="$4" ++ test_path_base_fail "$change" "$prefix" "$field" "$suffix" ++ test_path_fail "$change" "garbage after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Garbage after $field" ++ test_path_fail "$change" "space after quoted $field" "$prefix" '"hello.c" ' "$suffix" "Garbage after $field" +} +test_path_eol_fail () { -+ test_path_eol_quoted_fail -+ test_path_fail 'empty unquoted path' '' "Missing $field" ++ local change="$1" prefix="$2" field="$3" suffix="$4" ++ test_path_eol_quoted_fail "$change" "$prefix" "$field" "$suffix" +} +test_path_space_fail () { -+ test_path_base_fail -+ test_path_fail 'empty unquoted path' '' "Missing $field" -+ test_path_fail "missing space after quoted $field" '"hello.c"x' "Missing space after $field" ++ local change="$1" prefix="$2" field="$3" suffix="$4" ++ test_path_base_fail "$change" "$prefix" "$field" "$suffix" ++ test_path_fail "$change" "missing space after quoted $field" "$prefix" '"hello.c"x' "$suffix" "Missing space after $field" +} + -+change=filemodify prefix='M 100644 :1 ' field=path suffix='' test_path_eol_fail -+change=filedelete prefix='D ' field=path suffix='' test_path_eol_fail -+change=filecopy prefix='C ' field=source suffix=' world.c' test_path_space_fail -+change=filecopy prefix='C hello.c ' field=dest suffix='' test_path_eol_fail -+change=filerename prefix='R ' field=source suffix=' world.c' test_path_space_fail -+change=filerename prefix='R hello.c ' field=dest suffix='' test_path_eol_fail -+change='ls (in commit)' prefix='ls :2 ' field=path suffix='' test_path_eol_fail ++test_path_eol_fail filemodify 'M 100644 :1 ' path '' ++test_path_eol_fail filedelete 'D ' path '' ++test_path_space_fail filecopy 'C ' source ' world.c' ++test_path_eol_fail filecopy 'C hello.c ' dest '' ++test_path_space_fail filerename 'R ' source ' world.c' ++test_path_eol_fail filerename 'R hello.c ' dest '' ++test_path_eol_fail 'ls (in commit)' 'ls :2 ' path '' + -+# When 'ls' has no <tree-ish>, the <path> must be quoted. -+change='ls (without tree-ish in commit)' prefix='ls ' field=path suffix='' \ -+test_path_eol_quoted_fail && -+test_path_fail 'empty unquoted path' '' "Invalid dataref" ++# When 'ls' has no <dataref>, the <path> must be quoted. ++test_path_eol_quoted_fail 'ls (without dataref in commit)' 'ls ' path '' + ### ### series T (ls) 2: a2aca9f9e6 ! 2: 82a6f53c13 fast-import: directly use strbufs for paths @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b die("Missing space after SHA1: %s", command_buf.buf); } +- strbuf_reset(&uq); - parse_path_eol(&uq, p, "path"); - p = uq.buf; ++ strbuf_reset(&path); + parse_path_eol(&path, p, "path"); /* Git does not track empty, non-toplevel directories. */ @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b - static struct strbuf uq = STRBUF_INIT; + static struct strbuf path = STRBUF_INIT; +- strbuf_reset(&uq); - parse_path_eol(&uq, p, "path"); - p = uq.buf; - tree_content_remove(&b->branch_tree, p, NULL, 1); ++ strbuf_reset(&path); + parse_path_eol(&path, p, "path"); + tree_content_remove(&b->branch_tree, path.buf, NULL, 1); } @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b + static struct strbuf dest = STRBUF_INIT; struct tree_entry leaf; +- strbuf_reset(&s_uq); - parse_path_space(&s_uq, p, &p, "source"); -- parse_path_eol(&d_uq, p, "dest"); - s = s_uq.buf; -- d = d_uq.buf; ++ strbuf_reset(&source); + parse_path_space(&source, p, &p, "source"); + + if (!p) + die("Missing dest: %s", command_buf.buf); +- strbuf_reset(&d_uq); +- parse_path_eol(&d_uq, p, "dest"); +- d = d_uq.buf; ++ strbuf_reset(&dest); + parse_path_eol(&dest, p, "dest"); memset(&leaf, 0, sizeof(leaf)); @@ builtin/fast-import.c: static void file_change_m(const char *p, struct branch *b &leaf.versions[1].oid, leaf.versions[1].mode, leaf.tree); -@@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b) +@@ builtin/fast-import.c: static void print_ls(int mode, const unsigned char *hash, const char *path) + + static void parse_ls(const char *p, struct branch *b) { - struct tree_entry *root = NULL; - struct tree_entry leaf = {NULL}; - static struct strbuf uq = STRBUF_INIT; + static struct strbuf path = STRBUF_INIT; + struct tree_entry *root = NULL; + struct tree_entry leaf = {NULL}; - /* ls SP (<tree-ish> SP)? <path> */ - if (*p == '"') { @@ builtin/fast-import.c: static void parse_ls(const char *p, struct branch *b) root->versions[1].mode = S_IFDIR; load_tree(root); } +- strbuf_reset(&uq); - parse_path_eol(&uq, p, "path"); - p = uq.buf; - tree_content_get(root, p, &leaf, 1); ++ strbuf_reset(&path); + parse_path_eol(&path, p, "path"); + tree_content_get(root, path.buf, &leaf, 1); /* 3: ecaf4883d1 < -: ---------- fast-import: release unfreed strbufs -: ---------- > 3: 893bbf5e73 fast-import: allow unquoted empty path for root 4: 058a38416a ! 4: cb05a184e6 fast-import: remove dead strbuf @@ Metadata ## Commit message ## fast-import: remove dead strbuf - The strbuf in `note_change_n` has been unused since the function was + The strbuf in `note_change_n` is to copy the remainder of `p` before + potentially invalidating it when reading the next line. However, `p` is + not used after that point. It has been unused since the function was created in a8dd2e7d2b (fast-import: Add support for importing commit notes, 2009-10-09) and looks to be a fossil from adapting - `note_change_m`. Remove it. + `file_change_m`. Remove it. Signed-off-by: Thalia Archibald <thalia@xxxxxxxxxxxxx> 5: a5e8df0759 < -: ---------- fast-import: document C-style escapes for paths 6: 9792940ba9 < -: ---------- fast-import: forbid escaped NUL in paths -: ---------- > 5: 1f34b632d7 fast-import: improve documentation for unquoted paths -: ---------- > 6: 82a4da68af fast-import: document C-style escapes for paths -: ---------- > 7: c087c6a860 fast-import: forbid escaped NUL in paths -: ---------- > 8: a503c55b83 fast-import: make comments more precise -- 2.44.0