Re: [PATCH v2 4/4] diff.c: convert builtin funcname patterns to non-GNU extended regex syntax

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Junio C Hamano <gitster@xxxxxxxxx> writes:

> static const struct funcname_pattern_entry builtin_funcname_pattern[] = {
> 	{ "bibtex", "(@[a-zA-Z]{1,}[ \t]*\{{0,1}[ \t]*[^ \t\"@',\\#}{~%]*).*$",
> 	  REG_EXTENDED },
> 	{ "html", "^[ \t]*(<[Hh][1-6][ \t].*>.*)$", REG_EXTENDED },
> 	{ "java",
> 	  "!^[ \t]*(catch|do|for|if|instanceof|new|return|switch|throw|while)\n"
> 	  "^[ \t]*(([ \t]*[A-Za-z_][A-Za-z_0-9]*){2,}[ \t]*\\([^;]*)$",
> 	  REG_EXTENDED },
> 	{ "pascal",
> 	  "^((procedure|function|constructor|destructor|interface|"
> 		"implementation|initialization|finalization)[ \t]*.*)$"
> 	  "|"
> 	  "^(.*=[ \t]*(class|record).*)$",
> 	  REG_EXTENDED },
> 	{ "php", "^[\t ]*((function|class).*)", REG_EXTENDED },
> 	{ "python", "^[ \t]*((class|def)[ \t].*)$", REG_EXTENDED },
> 	{ "ruby", "^[ \t]*((class|module|def)[ \t].*)$",
> 	  REG_EXTENDED },
> 	{ "tex",
> 	  "^(\\\\((sub)*section|chapter|part)\\*{0,1}\{.*)$",
> 	  REG_EXTENDED },
> };

On the "Objective-C hunk header" topic, I mentioned that I wondered how a
construct like:

	"( A | B ) | ( C | D )"

could possibly work, given that xdiff-interface.c::ff_regexp() uses a
regmatch_t array with only two elements to capture (which means it can
only capture $0 and $1).

It turns out that the function is not really prepared to do this
properly.  In order to cope with a pattern without _any_ capturing
parentheses in the regexp, it has a trick to use $0 if $1 is undefined.

In other words, if you write:

	"junk ( A | B ) | garbage ( C | D )"

then lines "junk A", "junk B", "garbage C" and "garbage D" would all match
and be on the hunk header line.  However, "garbage" is not stripped out
(while "junk" does get left out of the capture).

This happens not to be a problem with any of the above built-in patterns.
The risky one is "pascal", but it's $2 captures the entire line anyway, so
it does not make any difference if the code used $0 instead.  E.g.  a line
that matches the second alternative, "foo = class bar", will be captured
as $0 and this is the same as (un)captured $2.

But it would be a good idea to fix this while our attention to the issue
is still fresh.  As we discussed earlier, we can replace the top-level "|"
with "\n" and fix the semantics of multiple positive regexp to grab $1 (or
$0 if $1 is unavaialble) of the first positive one.  If we did so, when
the pattern

	"junk ( A | B ) | garbage ( C | D )"

matches a line "garbage C", the initial garbage part will not be included
inside the capture.

Here is such a proposed fix.

-- >8 --
diff: fix "multiple regexp" semantics to find hunk header comment

When multiple regular expressions are concatenated with "\n", they were
traditionally AND'ed together, and only a line that matches _all_ of them
were taken as a match.  This however is unwieldy when multiple regexp
feature is used to specify alternatives.

This fixes the semantics to take the first match.  A nagative pattern, if
matches, makes the line to fail.  A positive pattern, if matches, will be
the final match and what it captures in $1 is used as the hunk header
comment.

We could write alternatives using "|" in ERE, but the machinery can only
use captured $1 as the hunk header comment (or $0 if there is no match in
$1), so you cannot write:

    "junk ( A | B ) | garbage ( C | D )"

and expect both "junk" and "garbage" to get stripped with the existing
code.  With this fix, you can write it as:

    "junk ( A | B ) \n garbage ( C | D )"

and the way capture works would match the user expectation more
naturally.

Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx>
---
 diff.c            |    2 +-
 xdiff-interface.c |   17 ++++++++++-------
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git c/diff.c w/diff.c
index a733010..1bcbbd5 100644
--- c/diff.c
+++ w/diff.c
@@ -1414,7 +1414,7 @@ static const struct funcname_pattern_entry builtin_funcname_pattern[] = {
 	{ "pascal",
 	  "^((procedure|function|constructor|destructor|interface|"
 		"implementation|initialization|finalization)[ \t]*.*)$"
-	  "|"
+	  "\n"
 	  "^(.*=[ \t]*(class|record).*)$",
 	  REG_EXTENDED },
 	{ "php", "^[\t ]*((function|class).*)", REG_EXTENDED },
diff --git c/xdiff-interface.c w/xdiff-interface.c
index 7f1a7d3..6c6bb19 100644
--- c/xdiff-interface.c
+++ w/xdiff-interface.c
@@ -194,26 +194,29 @@ static long ff_regexp(const char *line, long len,
 	char *line_buffer = xstrndup(line, len); /* make NUL terminated */
 	struct ff_regs *regs = priv;
 	regmatch_t pmatch[2];
-	int result = 0, i;
+	int i;
+	int result = -1;
 
 	for (i = 0; i < regs->nr; i++) {
 		struct ff_reg *reg = regs->array + i;
-		if (reg->negate ^ !!regexec(&reg->re,
-					line_buffer, 2, pmatch, 0)) {
-			free(line_buffer);
-			return -1;
+		if (!regexec(&reg->re, line_buffer, 2, pmatch, 0)) {
+			if (reg->negate)
+				goto fail;
+			break;
 		}
 	}
+	if (regs->nr <= i)
+		goto fail;
 	i = pmatch[1].rm_so >= 0 ? 1 : 0;
 	line += pmatch[i].rm_so;
 	result = pmatch[i].rm_eo - pmatch[i].rm_so;
 	if (result > buffer_size)
 		result = buffer_size;
 	else
-		while (result > 0 && (isspace(line[result - 1]) ||
-					line[result - 1] == '\n'))
+		while (result > 0 && (isspace(line[result - 1])))
 			result--;
 	memcpy(buffer, line, result);
+ fail:
 	free(line_buffer);
 	return result;
 }
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux