Sorry for the long email. Feel free to skip over the background info and continue reading at "My solution" below. This odyssey started with a typo: $ git shortlog -snl v2.2.0..v2.3.0 14795 Junio C Hamano 1658 Jeff King 1400 Shawn O. Pearce 1109 Linus Torvalds 760 Jonathan Nieder [...] It took me a minute to realize why so many commits are listed. It turns out that "-l", which I added by accident, requires an integer argument, but it happily swallows "v2.2.0..v2.3.0" without error. This leaves no range argument, so the command traverses the entire history of HEAD. The relevant code is else if ((argcount = short_opt('l', av, &optarg))) { options->rename_limit = strtoul(optarg, NULL, 10); return argcount; } , which is broken in many ways. First of all, strtoul() is way too permissive for simple tasks like this: * It allows leading whitespace. * It allows arbitrary trailing characters. * It allows a leading sign character ('+' or '-'), even though the result is unsigned. * If the string doesn't contain a number at all, it sets its "endptr" argument to point at the start of the string but doesn't set errno. * If the value is larger than fits in an unsigned long, it returns the value clamped to the range 0..ULONG_MAX (setting errno to ERANGE). * If the value is between -ULONG_MAX and 0, it returns the positive integer with the same bit pattern, without setting errno(!) (I can hardly think of a scenario where this would be useful.) * For base=0 (autodetect base), it allows a base specifier prefix "0x" or "0" and parses the number accordingly. For base=16 it also allows a "0x" prefix. strtol() has similar foibles. The caller compounds the permissivity of strtoul() with further sins: * It does absolutely no error detection. * It assigns the return value, which is an unsigned long, to an int value. This could truncate the result, perhaps even resulting in rename_limit being set to a negative value. When I looked around, I found scores of sites that call atoi(), strtoul(), and strtol() carelessly. And it's understandable. Calling any of these functions safely is so much work as to be completely impractical in day-to-day code. git-compat-util.h has two functions, strtoul_ui() and strtol_i(), that try to make parsing integers a little bit easier. But they are only used in a few places, they hardly restrict the pathological flexibility of strtoul()/strtol(), strtoul_ui() doesn't implement clamping consistently when converting from unsigned long to unsigned int, and neither function can be used when the integer value *is* followed by other characters. My solution: the numparse module So I hereby propose a new module, numparse, to make it easier to parse integral values from strings in a convenient, safe, and flexible way. The first commit introduces the new module, and subsequent commits change a sample (a small fraction!) of call sites to use the new functions. Consider it a proof-of-concept. If people are OK with this approach, I will continue sending patches to fix other call sites. (I already have another two dozen such patches in my repo). Please see the docstrings in numparse.h in the first commit for detailed API docs. Briefly, there are eight new functions: parse_{l,ul,i,ui}(const char *s, unsigned int flags, T *result, char **endptr); convert_{l,ul,i,ui}(const char *s, unsigned int flags, T *result); The parse_*() functions are for parsing a number from the start of a string; the convert_*() functions are for parsing a string that consists of a single number. The flags argument selects not only the base of the number, but also which of strtol()/strtoul()'s many features should be allowed. The *_i() and *.ui() functions are for parsing int and unsigned int values; they are careful about how they truncate the corresponding long values. The functions all return 0 on success and a negative number on error. Here are a few examples of how these functions can be used (taken from the header of numparse.h): * Convert hexadecimal string s into an unsigned int. Die if there are any characters in s besides hexadecimal digits, or if the result exceeds the range of an unsigned int: if (convert_ui(s, 16, &result)) die("..."); * Read a base-ten long number from the front of a string, allowing sign characters and setting endptr to point at any trailing characters: if (parse_l(s, 10 | NUM_SIGN | NUM_TRAILING, &result, &endptr)) die("..."); * Convert decimal string s into a signed int, but not allowing the string to contain a '+' or '-' prefix (and thereby indirectly ensuring that the result will be non-negative): if (convert_i(s, 10, &result)) die("..."); * Convert s into a signed int, interpreting prefix "0x" to mean hexadecimal and "0" to mean octal. If the value doesn't fit in an unsigned int, set result to INT_MIN or INT_MAX. if (convert_i(s, NUM_SLOPPY, &result)) die("..."); My main questions: * Do people like the API? My main goal was to make these functions as painless as possible to use correctly, because there are so many call sites. * Is it too gimmicky to encode the base together with other options in `flags`? (I think it is worth it to avoid the need for another parameter, which callers could easily put in the wrong order.) * Am I making callers too strict? In many cases where a positive integer is expected (e.g., "--abbrev=<num>"), I have been replacing code like result = strtoul(s, NULL, 10); with if (convert_i(s, 10, &result)) die("..."); Strictly speaking, this is backwards incompatible, because the former allows things like --abbrev="+10" --abbrev=" 10" --abbrev="10 bananas" --abbrev=-18446744073709551606 (all equivalent to "--abbrev=10"), whereas the new code rejects all of them. It would be easy to change the new code to allow the first three, by adding flags NUM_PLUS, NUM_LEADING_WHITESPACE, or NUM_TRAILING respectively. But my inclination is to be strict (though I could easily be persuaded to permit the first one). * Should I submit tiny patches, each rewriting a single call site, or make changes in larger chunks (say, file by file)? For example, in revision.c there is a function handle_revision_opt() that contains about 10 call sites that should be rewritten, for 10 different options that take integers. My gut feeling is that the rewrite of "--min-age=<value>" should be done in a different patch than that of "--min-parents=<value>", especially since the invocations might use different levels of parsing strictness that might be topics of discussion on the ML. On the other hand, I feel silly bombarding the list with tons of tiny patches. * I couldn't think of any places where today's sloppy parsing could result in security vulnerabilities, but people should think about this, too. I would be especially wary of sites that call strtoul() and assign the result to an int, seemingly not considering that the result could end up negative. These patches apply to "master". They are also available on my GitHub repo [1]. Michael [1] https://github.com/mhagger/git.git, branch "numparse1" Michael Haggerty (14): numparse: new module for parsing integral numbers cacheinfo_callback(): use convert_ui() when handling "--cacheinfo" write_subdirectory(): use convert_ui() for parsing mode handle_revision_opt(): use skip_prefix() in many places handle_revision_opt(): use convert_i() when handling "-<digit>" strtoul_ui(), strtol_i(): remove functions handle_revision_opt(): use convert_ui() when handling "--abbrev=" builtin_diff(): detect errors when parsing --unified argument opt_arg(): val is always non-NULL opt_arg(): use convert_i() in implementation opt_arg(): report errors parsing option values opt_arg(): simplify pointer handling diff_opt_parse(): use convert_i() when handling "-l<num>" diff_opt_parse(): use convert_i() when handling --abbrev=<num> Makefile | 1 + builtin/update-index.c | 3 +- contrib/convert-objects/convert-objects.c | 3 +- diff.c | 55 ++++---- git-compat-util.h | 26 ---- numparse.c | 180 ++++++++++++++++++++++++++ numparse.h | 207 ++++++++++++++++++++++++++++++ revision.c | 64 ++++----- 8 files changed, 447 insertions(+), 92 deletions(-) create mode 100644 numparse.c create mode 100644 numparse.h -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html