Hi, On Mon, 14 Dec 2009, Junio C Hamano wrote: > I am curious about and would prefer to see the story behind '6' someday. The '6' as a cut-off to the levenshtein distances we list when autocorrecting was derived in a totally scientific manner: 1) first I implemented Levenshtein-Damerau with a configurable weight of neighbor flips ("switches"), substitutions, additions and deletions, 2) next I patched the code to sort the availablecommands by their distance to the mispelt command, 3) as this lists way too much, I implemented a cut-off that was configurable by an environment variable (without any safety checks, as I did not plan to release that code anyway), 4) now comes the totally, unbelievably cunningly scientific part: I did a self-experiment! I deliberately mispelt commands in a totally random manner! 5) then I changed the code to actually output the distances so I could determine a cut-off that makes sense with my type of tyops, 6) after about 15 tries of deliberate mistakes (mostly doing what I usually do, something like "git pull" and "git log" or something like that, but watching TV, chatting on the phone _and_ cleaning the dishes at the same time), I found that 5 was too low and 7 too large. The number '6' happily coincided with the number of steps I needed to come up with the number. You see? The _perfect_ way to determine a completely arbitrary number. Actually, you probably see that I just made up that number and tested a few times, and it seemed to work reasonably well. FWIW almost the same procedure led to the weights 0, 2, 1 and 4 that you see in help.c. The weights are basically factors with which mistakes are punished: if you just confuse two adjacent letters, such as "psuh" instead of "push" (which can be quite common if you use two hands, one on the left side, and one on the right side of the keyboard, with an en-US layout so many of us use, myself included) it costs 0. If you write a different character than what you intended, the cost is 2. The idea behind it is that you're more likely to miss a key than to hit the wrong key. With the laptop I am typing this email on, it is particularly likely that I miss a key, because there are certain key combinations where only the first key triggers an input event, but the second only triggers an input event when it is _released_ after the first one. So when I type "er" real fast and happen to release the "e" key after the "r" key, no "r" appears on my screen. Okay, so the weight for adding a character must be smaller than substituting a character, but why is the cost for deletion so high? Well, I really rarely type unnecessary characters (except when writing to the Git mailing list, arguably) so those costs must be substantially higher than for typing the wrong character. My original plan was to log all my tyops into a log file and analyze those errors later, but then my initial 0, 2, 1, 4 and 6 constants worked well enough for me that I did not bother. Satisfied? Ciao, Dscho -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html