Re: [RFC 0/1] Leading whitespace as a function identification heuristic?

Phillip Wood <phillip.wood123@xxxxxxxxx> · Fri, 25 Sep 2020 10:11:33 +0100

On 24/09/2020 23:01, Ryan Zoeller wrote:
On 9/24/20 4:17 PM, Jeff King wrote:

If I understand the problem correctly, I don't think a static regex can
accomplish this, because you need context from the original line. E.g.,
consider something like this:

    void foo() {
	void bar() {
		...code 1...
	}
	...code 2...
    }

If I've understood correctly when ...code 2... contains changes that are 
themselves indented then we'll pick the wrong function header for example

	void foo() {
		void bar() {
			...code 1...
		}
		for (...) {
			// if this line is changed we pick bar rather
			// than foo because it is the first function
			// header with a smaller indentation than the
			// first changed line.
		}
	}

Unfortunately I suspect code like that is not uncommon and the diff 
would regress with this simple heuristic. It might be possible to 
recalculate the required indentation as we walk backwards up the file 
though, so when we hit the "for" line we reduce the maximum indentation 
allowed for a match and so skip "bar" as a function header.

Best Wishes

Phillip

If we change one of the lines of code, then we find the function header
by walking backwards up the lines, evaluating a regex for each line. But
for "code 2", how do we know to keep walking past bar() and up to foo()?
Or more specifically, what is different when evaluating a change from
"code 2" that is different than when we would evaluate "code 1"?

If the only input to the question of "is this line a function header" is
the regex from the config, then changes to either line of code must
produce the same answer (either bar() if we allow leading whitespace, or
foo() if we do not).

So I think Ryan's proposal is to give that search an extra piece of
information: the indentation of the original changed line. Which is
enough to differentiate the two cases.

You've explained this better than I could have. Thanks.

If I understand the patch correctly, it is always picking the first line
where indentation is lessened (and which matches the funcname pattern).
That works out of the box with existing funcname patterns, which is
nice. Though I wonder if there are cases where the funcname regex could
use the information more flexibly (i.e., some marker in the regex that
means "match less than this many spaces" or something).

My original intent was to work with existing funcname expressions
without modifications. Some of the funcname regexes are rather
impenetrable at first glance, and not requiring modifications seemed
like an easy win.

Especially for funcname patterns specified by a user, I assumed it
would be easier to set an additional configuration option than
rewrite an existing regex to handle this complexity.

I do agree that this should not be on automatically for all languages.
Some file formats may want to show a header that's at the same
indentation as the change. Adding a diff.foo.funcnameIndentation config
option would be one solution. But requiring the funcname pattern to make
explicit use of the information is another (and would allow a format to
match some things at one indentation level and some at another; but
again, I'm hand-waving a bit on whether this level of flexibility is
even useful)
If the configuration option is implemented correctly (i.e. as an enum
rather than a binary toggle), I think we could leave the door open for
a more flexible approach in the future, without needing to answer how
useful that flexibility is now. I couldn't think of any situations
requiring this flexibility, but that doesn't mean they don't exist.

Ryan