On Fri, Oct 27 2017, Joe Perches jotted: > On Sat, 2017-10-28 at 00:11 +0200, Ævar Arnfjörð Bjarmason wrote: >> On Fri, Oct 27 2017, Joe Perches jotted: > [] >> > git grep performance has already been >> > quite successfully improved. >> >> ...and I have WIP patches to use the PCRE engine for patterns without -P >> which I intend to start sending soon after the next release. > > One addition that would be quite nice would be > an option to have regex matches span input lines. > > grep v2.54 was the last grep version that allowed > this and I keep it around just for that. > > ie: > > $ cat hello.txt > Hello > World > $ grep -P "Hello\s*World" hello.txt > $ grep-2.5.4 -P "Hello\s*World" hello.txt > Hello > World I'm unable to build 2.5.4 and can't find anything relevant in the release notes at a quick glance around that time saying that this would be removed, if you can still build it I'd be interested to see what this bisects down to in grep.git. But aside from that, a feature like this constrains the regex implementation a lot since it's going to need to either match the entire file as we'd need to do with PCRE, or we'd need to really deeply embed the core logic of the regex matcher into our grep implementation. I.e. in this case a more optimal implementation would start by parsing this regex down: ((EXACT "Hello") (STAR (POSIXU "\s")) (EXACT "World")) Then when you open the file you can start searching for the fixed-string "Hello", if you don't find that you're done, if you do you can forward look-ahead for the fixed "World", and only if you find that do you need to match the more complex part in the middle. Whereas our API for the internal regex matchers now is that we find the boundaries of newlines and batch-match a bunch of lines with a match() function that takes a string, and if that matches we drill down to what specific line matches. Which is not to say that this can't be done without a potentially unacceptable memory trade-off (i.e. matching the entire file in all cases), the PCRE2 engine in particular includes some I/O abstractions that we're not using but could (but I haven't looked into it). But right now the entire internal API we have is constrained by catering to the lowest common denominator (a regexec that takes a char*), so supporting more fancy multi-line matching features can be a PITA since we'd need to maintain both codepaths. Or we could make PCRE a hard dependency, which given the performance advantages I'm increasingly willing to make the case for.