Re: [PATCH v9 6/8] convert: check for detectable errors in UTF encodings

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 05 Mar 2018 17:23:32 -0800

Lars Schneider <larsxschneider@xxxxxxxxx> writes:

>> On 05 Mar 2018, at 22:50, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>> 
>> lars.schneider@xxxxxxxxxxxx writes:
>> 
>>> +static int validate_encoding(const char *path, const char *enc,
>>> +		      const char *data, size_t len, int die_on_error)
>>> +{
>>> +	if (!memcmp("UTF-", enc, 4)) {
>> 
>> Does the caller already know that enc is sufficiently long that
>> using memcmp is safe?
>
> No :-(
>
> Would you be willing to squash that in?
>
>     if (strlen(enc) > 4 && !memcmp("UTF-", enc, 4)) {
>
> I deliberately used "> 4" as plain "UTF-" is not even valid.

I'd rather not.  The code does not have to even look at 6th and
later bytes in the enc[] even if it wanted to reject "UTF-" followed
by nothing, but use of strlen() forces it to look at everything.

Stepping back, shouldn't

	if (starts_with(enc, "UTF-") 

be sufficient?  If you really care about the case where "UTF-" alone
comes here, you could write

	if (starts_with(enc, "UTF-") && enc[4])

but I do not think "&& enc[4]" is even needed.  The functions called
from this block would not consider "UTF-" alone as something valid
anyway, so with that "&& enf[4]" we would be piling more code only
for invalid/rare case.