[RFC 0/1] Leading whitespace as a function identification heuristic?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've recently been annoyed by git's choice of lines when invoking
`git diff --function-context`. git isn't doing a _bad_ job per se,
given that it's using regular expressions to parse nonregular languages,
but in my opinion it could do better.

The issue I'm seeing is that constructs like nested functions cause
`git diff --function-context` to stop taking lines prematurely; a line
containing a function declaration has been reached, but it's not the one
I'm expecting git to stop at. This results in some of the function being
omitted from the output of `git diff --function-context`, and also
manifests itself as the incorrect function being listed in the hunk header
in general.

A trivial example: in a Python file containing the following contents,
modifying the indicated line will result in git picking `bar()` for the
function header, rather than `foo()`. This is obviously worse for longer
functions, or code with multiple inner functions or classes.

```
def foo(xs, i):
    def bar(x):
        return x + i

    # Intentionally not using a lambda,
    # because the issue occurs for inner functions.
    # Making this comment extra long so the function name
    # doesn't get automatically included in the context.

    return map(bar, xs) # Modify this line
```

Even without properly parsing the file as code, git could use
indentation as a heuristic to determine which lines containing functions
are relevant. I've used Python as an example here, but I think the
issue and heuristic apply to many languages.

This patch implements a proof of concept for this heuristic: only
lines which are indented less than the hunk are considered eligible
to contain function definitions. It only supports `git diff`,
ignoring things like `git grep` for simplicity (and introducing
`-1`s to the diff in leiu of handling them).

While I welcome feedback on the existing implementation, I'm really
requesting commentary on the following:

1. Is this indentation-aware function identification useful, and
    generally worth pursuing further?
2. What level of granularity would be desirable for enabling this?
    Something akin to xfuncname in .gitattributes, where it can be
    enabled per path?
3. How do we handle files with a mix of tabs and spaces?

Best,
Ryan

Ryan Zoeller (1):
  xdiff: use leading whitespace in function heuristic

 grep.c            |  2 +-
 line-range.c      |  2 +-
 xdiff-interface.c | 14 +++++++++++++-
 xdiff/xdiff.h     |  2 +-
 xdiff/xemit.c     | 23 +++++++++++++++++------
 5 files changed, 33 insertions(+), 10 deletions(-)

-- 
2.28.0.586.g47c91ef7fe






[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux