Am 07.03.24 um 10:26 schrieb Jeff King: > As with the previous patch, we need to swap out single-byte matching for > something like starts_with() to match all bytes of a multi-byte comment > character. But for cases where the buffer is not NUL-terminated (and we > instead have an explicit size or end pointer), it's not safe to use > starts_with(), as it might walk off the end of the buffer. > > Let's introduce a new starts_with_mem() that does the same thing but > also accepts the length of the "haystack" str and makes sure not to walk > past it. > > Note that in most cases the existing code did not need a length check at > all, since it was written in a way that knew we had at least one byte > available (and that was all we checked). So I had to read each one to > find the appropriate bounds. The one exception is sequencer.c's > add_commented_lines(), where we can actually get rid of the length > check. Just like starts_with(), our starts_with_mem() handles an empty > haystack variable by not matching (assuming a non-empty prefix). > > A few notes on the implementation of starts_with_mem(): > > - it would be equally correct to take an "end" pointer (and indeed, > many of the callers have this and have to subtract to come up with > the length). I think taking a ptr/size combo is a more usual > interface for our codebase, though, and has the added benefit that > the function signature makes it harder to mix up the three > parameters. > > - we could obviously build starts_with() on top of this by passing > strlen(str) as the length. But it's possible that starts_with() is a > relatively hot code path, and it should not pay that penalty (it can > generally return an answer proportional to the size of the prefix, > not the whole string). > > - it naively feels like xstrncmpz() should be able to do the same > thing, but that's not quite true. If you pass the length of the > haystack buffer, then strncmp() finds that a shorter prefix string > is "less than" than the haystack, even if the haystack starts with > the prefix. If you pass the length of the prefix, then you risk > reading past the end of the haystack if it is shorter than the > prefix. So I think we really do need a new function. Yes. xstrncmpz() compares a NUL-terminated string and a length-limited string. If you want to check whether the former is a prefix of the latter then you need to stop comparing when reaching its NUL, and also after exhausting the latter. So you need to take both lengths into account: int starts_with_mem(const char *str, size_t len, const char *prefix) { size_t prefixlen = strlen(prefix); return prefixlen <= len && !xstrncmpz(prefix, str, prefixlen); } Using memcmp() here is equivalent and simpler: int starts_with_mem(const char *str, size_t len, const char *prefix) { size_t prefixlen = strlen(prefix); return prefixlen <= len && !memcmp(str, prefix, prefixlen); } And your version below avoids function calls and avoids traversing the strings beyond their common prefix, of course. > > Signed-off-by: Jeff King <peff@xxxxxxxx> > --- > Arguably starts_with() and this new function should both be inlined, > like we do for skip_prefix(), but I think that's out of scope for this > series. Inlining would allow the compiler to unroll the loop for string constants. I doubt it would do that for variables, as in the code below. Inlining the strlen()+memcmp() version above might allow the compiler to push the strlen() call out of a loop. Would any of that improve performance noticeably? For the call sites below I doubt it. But it would probably increase the object text size. > And it's possible I was simply too dumb to figure out xstrncmpz() here. > I'm waiting for René to show up and tell me how to do it. ;) Nah, it's not a good fit, as it requires the two strings to have the same length. > > IMHO this is the trickiest commit of the whole series, as it would be > easy to get the length computations subtly wrong. > > commit.c | 3 ++- > sequencer.c | 4 ++-- > strbuf.c | 11 +++++++++++ > strbuf.h | 1 + > trailer.c | 4 ++-- > 5 files changed, 18 insertions(+), 5 deletions(-) > > diff --git a/commit.c b/commit.c > index ef679a0b93..531a666cba 100644 > --- a/commit.c > +++ b/commit.c > @@ -1796,7 +1796,8 @@ size_t ignored_log_message_bytes(const char *buf, size_t len) > else > next_line++; > > - if (buf[bol] == comment_line_char || buf[bol] == '\n') { > + if (starts_with_mem(buf + bol, cutoff - bol, comment_line_str) || > + buf[bol] == '\n') { > /* is this the first of the run of comments? */ > if (!boc) > boc = bol; > diff --git a/sequencer.c b/sequencer.c > index 991a2dbe96..664986e3b2 100644 > --- a/sequencer.c > +++ b/sequencer.c > @@ -1840,7 +1840,7 @@ static int is_fixup_flag(enum todo_command command, unsigned flag) > static void add_commented_lines(struct strbuf *buf, const void *str, size_t len) > { > const char *s = str; > - while (len > 0 && s[0] == comment_line_char) { > + while (starts_with_mem(s, len, comment_line_str)) { > size_t count; > const char *n = memchr(s, '\n', len); > if (!n) > @@ -2562,7 +2562,7 @@ static int parse_insn_line(struct repository *r, struct todo_item *item, > /* left-trim */ > bol += strspn(bol, " \t"); > > - if (bol == eol || *bol == '\r' || *bol == comment_line_char) { > + if (bol == eol || *bol == '\r' || starts_with_mem(bol, eol - bol, comment_line_str)) { If the strspn() call is safe (which it is, as the caller expects the string to be NUL-terminated) then you could use starts_with() here and avoid the length calculation. But that would also match comment_line_str values that contain LF, which the _mem version does not and that's better. Not sure why lines that start with CR are considered comment lines, though. > item->command = TODO_COMMENT; > item->commit = NULL; > item->arg_offset = bol - buf; > diff --git a/strbuf.c b/strbuf.c > index 7c8f582127..291bdc2a65 100644 > --- a/strbuf.c > +++ b/strbuf.c > @@ -24,6 +24,17 @@ int istarts_with(const char *str, const char *prefix) > return 0; > } > > +int starts_with_mem(const char *str, size_t len, const char *prefix) > +{ > + const char *end = str + len; > + for (; ; str++, prefix++) { > + if (!*prefix) > + return 1; > + else if (str == end || *str != *prefix) > + return 0; > + } > +} So this checks whether a length-limited string has a prefix given as a NUL-terminated string. I'd have called it mem_starts_with() and have expected starts_with_mem() to check a NUL-terminated string for a length-limited prefix (think !strncmp(str, prefix, prefixlen)). > + > int skip_to_optional_arg_default(const char *str, const char *prefix, > const char **arg, const char *def) > { > diff --git a/strbuf.h b/strbuf.h > index 58dddf2777..3156d6ea8c 100644 > --- a/strbuf.h > +++ b/strbuf.h > @@ -673,6 +673,7 @@ char *xstrfmt(const char *fmt, ...); > > int starts_with(const char *str, const char *prefix); > int istarts_with(const char *str, const char *prefix); > +int starts_with_mem(const char *str, size_t len, const char *prefix); > > /* > * If the string "str" is the same as the string in "prefix", then the "arg" > diff --git a/trailer.c b/trailer.c > index fe18faf6c5..f59c90b4b5 100644 > --- a/trailer.c > +++ b/trailer.c > @@ -882,7 +882,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) > > /* The first paragraph is the title and cannot be trailers */ > for (s = buf; s < buf + len; s = next_line(s)) { > - if (s[0] == comment_line_char) > + if (starts_with_mem(s, buf + len - s, comment_line_str)) > continue; > if (is_blank_line(s)) Another case where starts_with() would be safe to use, as is_blank_line() expects (and gets) a NUL-terminated string, but it would allow matching comment_line_str values that contain LF. > break; > @@ -902,7 +902,7 @@ static size_t find_trailer_block_start(const char *buf, size_t len) > const char **p; > ssize_t separator_pos; > > - if (bol[0] == comment_line_char) { > + if (starts_with_mem(bol, buf + end_of_title - bol, comment_line_str)) { We're in the same buffer, so the above comment applies here as well. > non_trailer_lines += possible_continuation_lines; > possible_continuation_lines = 0; > continue;