Re: [PATCH 01/14] numparse: new module for parsing integral numbers

Eric Sunshine <sunshine@xxxxxxxxxxxxxx> · Fri, 20 Mar 2015 04:54:42 -0400

On Wed, Mar 18, 2015 at 6:47 PM, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote:
> On 03/18/2015 07:27 PM, Eric Sunshine wrote:
>> On Tuesday, March 17, 2015, Michael Haggerty <mhagger@xxxxxxxxxxxx> wrote:
>>> Implement wrappers for strtol() and strtoul() that are safer and more
>>> convenient to use.
>>> + * The lowest 6 bits of flags hold the numerical base that should be
>>> + * used to parse the number, 2 <= base <= 36. If base is set to 0,
>>> + * then NUM_BASE_SPECIFIER must be set too; in this case, the base is
>>> + * detected automatically from the string's prefix.
>>
>> Does this restriction go against the goal of making these functions
>> convenient, even while remaining strict? Is there a strong reason for
>> not merely inferring NUM_BASE_SPECIFIER when base is 0? Doing so would
>> make it consistent with strto*l() without (I think) introducing any
>> ambiguity.
>
> I thought about doing this. If it were possible to eliminate
> NUM_BASE_SPECIFIER altogether, then there is no doubt that it would be a
> good change. But NUM_BASE_SPECIFIER also has an effect when base==16;
> namely, that an "0x" prefix, if present, is consumed. So
>
>     parse_i("0xff", 16 | NUM_BASE_SPECIFIER, &result, &endptr)
>
> gives result==255 and endptr==s+4, whereas
>
>     parse_i("0xff", 16, &result, &endptr)
>
> gives result==0 and entptr==s+1 (it treats the "x" as the end of the
> string).
>
> We could forgo that feature, effectively allowing a base specifier if
> and only if base==0. But I didn't want to rule out allowing an optional
> base specifier for base==16, in which case NUM_BASE_SPECIFIER can't be
> dispensed with entirely.

Making base==0 a special case doesn't have to mean ruling out or
eliminating NUM_BASE_SPECIFIER for the base==16 case. However, a
special case base==0 would make the API a bit non-orthogonal and
require extra documentation. But for those familiar with how strto*l()
treats base==0 specially, the rule of least surprise may apply.

> If you agree with that, then the remaining question is: which policy is
> less error-prone? My thinking was that forcing the caller to specify
> NUM_BASE_SPECIFIER explicitly when they select base==0 will serve as a
> reminder that the two features are intertwined.

Since base==0 would unambiguously imply NUM_BASE_SPECIFIER, being able
to tersely say convert_i(s, 0, &result) would be a win from a
convenience perspective. However...

> Because another
> imaginable policy (arguably more consistent with the policy for base!=0)
> would be that
>
>     convert_i(s, 0, &result)
>
> , because it *doesn't* specify NUM_BASE_SPECIFIER, doesn't allow a base
> prefix, and therefore indirectly only allows base-10 numbers.
>
> But I don't feel strongly about this.

I don't feel strongly about it either, and can formulate arguments
either way. Assuming you stick with your current design, if the
strictness of having to specify NUM_BASE_SPECIFIER for base==0 proves
too burdensome, it can be loosened later by making base==0 a special
case.

On the other hand, if you go with Junio's suggestion of choosing names
for these functions which more closely mirror strto*l() names, then
the rule of least surprise might suggest that a special case for
base==0 has merit.

>>> + * Return 0 on success or a negative value if there was an error. On
>>> + * failure, *result and *entptr are left unchanged.
>>> + *
>>> + * Please note that if NUM_TRAILING is not set, then it is
>>> + * nevertheless an error if there are any characters between the end
>>> + * of the number and the end of the string.
>>
>> Again, on the subject of convenience, why this restriction? The stated
>> purpose of the parse_*() functions is to parse the number from the
>> front of the string and return a pointer to the first non-numeric
>> character following. As  a reader of this API, I interpret that as
>> meaning that NUM_TRAILING is implied. Is there a strong reason for not
>> inferring NUM_TRAILING for the parse_*() functions at the API level?
>> (I realize that the convert_*() functions are built atop parse_*(),
>> but that's an implementation detail.)
>
> Yes, I'd also thought about that change:
>
> * Make NUM_TRAILING private.
> * Make the current parse_*() functions private.
> * Add new parse_*(s, flags, result, endptr) functions that imply
> NUM_TRAILING (and maybe they should *require* a non-null endptr argument).
> * Change the convert_*() to not allow the NUM_TRAILING flag.
>
> This would add a little bit of code, so I didn't do it originally. But
> since you seem to like the idea too, I guess I will make the change.

The other option (as Junio also suggested) is to collapse this API
into just the four parse_*() functions. All of the call-sites you
converted in this series invoked convert_*(), but it's not clear that
having two sets of functions which do almost the same thing is much of
a win. If the default is for NUM_TRAILING to be off, and if NULL
'endptr' is allowed, then:

    parse_ul(s, 10, &result, NULL);

doesn't seem particularly burdensome compared with:

    convert_ul(s, 10, &result);

A more compact API, with only the parse_*() functions, means less to
remember and less to document, and often is more orthogonal.

The argument for dropping the convert_*() functions is strengthened if
you follow Junio's suggestion to name these functions after the
strto*l() variations.

>>> +int parse_l(const char *s, unsigned int flags,
>>> +           long *result, char **endptr);
>>
>> Do we want to perpetuate the ugly (char **) convention for 'endptr'
>> from strto*l()? Considering that the incoming string is const, it
>> seems undesirable to return a non-const pointer to some place inside
>> that string.
>
> Yes, I guess we could do some internal casting and expose endptr as
> (const char **). It would make it a bit harder if the user actually
> wants to pass a non-const string to the function and then modify the
> string via the returned endptr, but I haven't run into that pattern yet
> so let's make the change you suggested.

I don't feel strongly about this either. It always struck me as a bit
of an API wart, but I can see the convenience value of non-const
'endptr' if the caller intends to modify the string. If you rename
these functions to mirror strto*l(), then the rule of least surprise
would suggest that 'endptr' should remain non-const.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html