On Tue, Sep 06, 2016 at 06:02:59PM +0200, Johannes Schindelin wrote: > It will still be quite tricky, because we have to touch a function that is > rather at the bottom of the food chain: diff_populate_filespec() is called > from fill_textconv(), which in turn is called from pickaxe_match(), and > only pickaxe_match() knows whether we want to call regexec() or not (it > depends on its regexp parameter). > > Adding a flag to diff_populate_filespec() sounds really reasonable until > you see how many call sites fill_textconv() has. I was thinking of something quite gross, like a global "switch to using slower-but-safer NUL termination" flag (but I agree with Junio's point elsewhere that we do not even know if it is "slower"). > > I thought that operated on the diff content itself, which would always > > be in a heap buffer (which should be NUL terminated, but if it isn't, > > that would be a separate fix from this). > > That is true. > > Except when preimage or postimage does not exist. In which case we call > > regexec(regexp, two->ptr, 1, ®match, 0); > > or the same with one->ptr. Note the notable absence of two->size. Thanks, I forgot about that case. > > [1] We do make the assumption elsewhere that git objects are > > NUL-terminated, but that is enforced by the object-reading code > > (with the exception of streamed blobs, but those are obviously dealt > > with separately anyway). > > I know. I am the reason you introduced that, because I added code to > fsck.c that assumes that tag/commit messages are NUL-terminated. Sort of. I think it has been part of the design since e871b64 (unpack_sha1_file: zero-pad the unpacked object., 2005-05-25), though I do recall that we missed a code path that did its allocation differently (in index-pack, IIRC). Anyway, that is neither here nor there for the diff code, which as you noticed may operate on things besides git objects. > So now for the better idea. > > While I was researching the code for this reply, I hit upon one thing that > I never knew existed, introduced in f96e567 (grep: use REG_STARTEND for > all matching if available, 2010-05-22). Apparently, NetBSD introduced an > extension to regexec() where you can specify buffer boundaries using > REG_STARTEND. Which is pretty much what we need. Yes, and compat/regex support this, too. My question is whether it is portable. I see: > diff --git a/diff.c b/diff.c > index 534c12e..2c5a360 100644 > --- a/diff.c > +++ b/diff.c > @@ -951,7 +951,13 @@ static int find_word_boundaries(mmfile_t *buffer, > regex_t *word_regex, > { > if (word_regex && *begin < buffer->size) { > regmatch_t match[1]; > - if (!regexec(word_regex, buffer->ptr + *begin, 1, match, > 0)) { > + int f = 0; > +#ifdef REG_STARTEND > + match[0].rm_so = 0; > + match[0].rm_eo = *end - *begin; > + f = REG_STARTEND; > +#endif > + if (!regexec(word_regex, buffer->ptr + *begin, 1, match, > f)) { What happens to those poor souls on systems without REG_STARTEND? Do they get to keep segfaulting? I think the solution is to push them into setting NO_REGEX. So looking at this versus a "regexecn", it seems: - this lets people keep using their native regexec if it supports STARTEND - this is a bit more clunky to use at the callsites (though we could _create_ a portable regexecn wrapper that uses this technique on top of the native regex library) But I much prefer this approach to copying the data just to add a NUL. -Peff