Re: [PATCH] help.autocorrect: do not run a command if the command given is junk

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 14 Dec 2009 13:47:28 -0800

Johannes Schindelin <Johannes.Schindelin@xxxxxx> writes:

> Satisfied?

Very much.

    FWIW almost the same procedure led to the weights 0, 2, 1 and 4 that you 
    see in help.c.  The weights are basically factors with which mistakes are 
    punished: if you just confuse two adjacent letters, such as "psuh" instead 
    of "push" (which can be quite common if you use two hands, one on the left 
    side, and one on the right side of the keyboard, with an en-US layout so 
    many of us use, myself included) it costs 0.

    If you write a different character than what you intended, the cost is 2.  
    The idea behind it is that you're more likely to miss a key than to hit 
    the wrong key.  With the laptop I am typing this email on, it is 
    particularly likely that I miss a key, because there are certain 
    key combinations where only the first key triggers an input event, but the 
    second only triggers an input event when it is _released_ after the first 
    one.  So when I type "er" real fast and happen to release the "e" key 
    after the "r" key, no "r" appears on my screen.

    Okay, so the weight for adding a character must be smaller than 
    substituting a character, but why is the cost for deletion so high?  
    Well, I really rarely type unnecessary characters (except when writing to 
    the Git mailing list, arguably) so those costs must be substantially 
    higher than for typing the wrong character.

These are actually very good justifications in the sense that people who
might want to tweak the heuristics can see the reason behind the current
choice and agree or disagree with it.

I somehow suspect that a good mathematician can come up with a rationale
for 6 after the fact that sounds convincing, along the lines of "the
average length of commands being N, and levenshtein penalties being
<0,2,1,4>, you can insert X mistaken keystroke and/or omit Y mistaken
keystroke per every correct keystroke without exceeding this value 6, and
the percentage X and/or Y represents is not too low to be practical but low
enough to reject false positives".

In any case, I'll further squash in the following.  Thanks for an amusing
explanation ;-).

diff --git a/help.c b/help.c
index fbf80d9..de1e2ea 100644
--- a/help.c
+++ b/help.c
@@ -297,7 +297,7 @@ static void add_cmd_list(struct cmdnames *cmds, struct cmdnames *old)
 	old->names = NULL;
 }
 
-/* how did we decide this is a good cutoff??? */
+/* An empirically derived magic number */
 #define SIMILAR_ENOUGH(x) ((x) < 6)
 
 const char *help_unknown_cmd(const char *cmd)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html