Re: [GSOC][PATCH] userdiff: add support for Scheme

Phillip Wood <phillip.wood123@xxxxxxxxx> · Tue, 6 Apr 2021 20:10:57 +0100

Hi Atharva

On 06/04/2021 13:29, Atharva Raykar wrote:
On 05-Apr-2021, at 15:34, Phillip Wood <phillip.wood123@xxxxxxxxx> wrote:

Hi Atharva

On 30/03/2021 11:22, Atharva Raykar wrote:
On 30-Mar-2021, at 12:34, Atharva Raykar <raykar.ath@xxxxxxxxx> wrote:



On 29-Mar-2021, at 15:48, Phillip Wood <phillip.wood123@xxxxxxxxx> wrote:

Hi Atharva

On 28/03/2021 13:23, Atharva Raykar wrote:
On 28-Mar-2021, at 05:16, Johannes Sixt <j6t@xxxxxxxx> wrote:
[...]

diff --git a/t/t4018/scheme-local-define b/t/t4018/scheme-local-define
new file mode 100644
index 0000000000..90e75dcce8
--- /dev/null
+++ b/t/t4018/scheme-local-define
@@ -0,0 +1,4 @@
+(define (higher-order)
+  (define local-function RIGHT

... this one, which is also indented and *is* marked as RIGHT.
In this test case, I was explicitly testing for an indented '(define'
whereas in the former, I was testing for the top-level '(define-syntax',
which happened to have an internal define (which will inevitably show up
in a lot of scheme code).

It would be nice to include indented define forms but including them means that any change to the body of a function is attributed to the last internal definition rather than the actual function. For example

(define (f arg)
(define (g x)
   (+ 1 x))

(some-func ...)
;;any change here will have '(define (g x)' in the hunk header, not '(define (f arg)'

The reason I went for this over the top level forms, is because
I felt it was useful to see the nearest definition for internal
functions that often have a lot of the actual business logic of
the program (at least a lot of SICP seems to follow this pattern).
The disadvantage is as you said, it might also catch trivial inner
functions and the developer might lose context.
Never mind this message, I had misunderstood the problem you were trying to
demonstrate. I wholeheartedly agree with what you are trying to say, and
the indentation heuristic discussed does look interesting. I shall have a
glance at the RFC you linked in the other reply.
The disadvantage is as you said, it might also catch trivial inner
functions and the developer might lose context.
Feel free to disregard me misquoting you here. You did not say that (:
Another problem is it may match more trivial bindings, like:

(define (some-func things)
  ...
  (define items '(eggs
                  ham
                  peanut-butter))
  ...)

What I have noticed *anecdotally* is that this is not common enough
to be too much of a problem, and local define bindings seem to be more
favoured in Racket than other Schemes, that use 'let' more often.

I don't think this can be avoided as we rely on regexs rather than parsing the source so it is probably best to only match toplevel defines.

The other issue with only matching top level defines is that a
lot of scheme programs are library definitions, something like

(library
    (foo bar)
  (export ...)
  (define ...)
  (define ...)
  ;; and a bunch of other definitions...
)

Only matching top level defines will completely ignore matching all
the definitions in these files.
That said, I still stand by the fact that only catching top level defines
will lead to a lot of definitions being ignored. Maybe the occasional
mismatch may be worth the gain in the number of function contexts being
detected?

I'm not sure that the mismatches will be occasional - every time you have an internal definition in a function the hunk header will be wrong when you change the main body of the function. This will affect grep --function-context and diff -W as well as the normal hunk headers. The problem is there is no way to avoid that and provide something useful in the library example you have above. It would be useful to find some code bases and diff the output of 'git log --patch' with and without the leading whitespace match in the function pattern to see how often this is a problem (i.e. when the funcnames do not match see which one is correct).

You are right -- on trying out the function on a two other scheme
codebases, I noticed that there are a lot more wrongly matched functions
than I initially thought. About half of them identify the wrong function
in one of the repositories I tried. However, removing the leading
whitespace in the pattern did not lead to better matching; it just led
to a lot of the hunk headers going blank. I am not sure what causes this
behaviour, but my guess is that the function contexts are shown only if
it is within a certain distance from the function definition?

Even if it did match only the top level defines correctly, the functions
matched would still often be technically wrong -- it will show the outer
function as the context when the user has edited an internal function
(and in Scheme, there is heavy usage of internal functions).

After running 'git grep --function-context' with the leading whitespace
removed, it seems to match too aggressively, as it captures a huge
region to match all the way upto the top level. Especially for files
where all the definitions are in a 'library'.

Overall, I personally felt that there were more downsides to matching
only at the top level. I'd rather the hunk header have the nearest
function to provide the context, than have no function displayed at all.
Even when the match is wrong, it at least helps me locate where the
change was made more easily.

Thanks for taking the time to check the differences between the two 
approaches, as there is no perfect solution I'm happy to go with the one 
that seemed to be best in your investigations

Best Wishes

Phillip

Best Wishes

Phillip