Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

Lars Schneider <larsxschneider@xxxxxxxxx> · Wed, 7 Mar 2018 23:12:38 +0100

> On 07 Mar 2018, at 20:49, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> 
> lars.schneider@xxxxxxxxxxxx writes:
> 
>> +static int validate_encoding(const char *path, const char *enc,
>> +		      const char *data, size_t len, int die_on_error)
>> +{
>> +	/* We only check for UTF here as UTF?? can be an alias for UTF-?? */
>> +	if (startscase_with(enc, "UTF")) {
>> +		/*
>> +		 * Check for detectable errors in UTF encodings
>> +		 */
>> +		if (has_prohibited_utf_bom(enc, data, len)) {
>> +			const char *error_msg = _(
>> +				"BOM is prohibited in '%s' if encoded as %s");
>> +			/*
>> +			 * This advice is shown for UTF-??BE and UTF-??LE encodings.
>> +			 * We cut off the last two characters of the encoding name
>> +			 # to generate the encoding name suitable for BOMs.
>> +			 */
> 
> Yuck.  The code pretends to abstract away the details in a helper
> has_prohibited_x() yet the caller still knows quite a lot.

True, but has_prohibited_x() cannot create a proper error/advise
message unless we give it more parameters (e.g. path name).
Therefore, I don't see a better way right now.

>> +			const char *advise_msg = _(
>> +				"The file '%s' contains a byte order "
>> +				"mark (BOM). Please use %s as "
>> +				"working-tree-encoding.");
>> +			char *upper_enc = xstrdup_toupper(enc);
>> +			upper_enc[strlen(upper_enc)-2] = '\0';
>> +			advise(advise_msg, path, upper_enc);
>> +			free(upper_enc);
> 
> I think this up-casing is more problematic than without, not from
> the point of view of the internal code, but from the point of view
> of the end user experience.  When the user writes utf16le or
> utf-16le and the data does not trigger the BOM check, we are likely
> to successfully convert it.  I do not see the merit of suggesting
> UTF16 or UTF-16 in such a case, over telling them to just drop the
> byte-order suffix from the encoding names (i.e. utf16 or utf-16).
> 
> If you are trying to force/nudge people in the direction of
> canonical way of spelling things (which may not be a bad idea), then
> "utf16le" as the original input would want to result in "UTF-16"
> with dash in the advise, no?

Correct. In the error messages I kept the encoding name "as-is" and
only in the advise message I used the uppercase variant to steer
people into the canonical direction. My initial reason for this was
that in is_missing_required_utf_bom() we add "BE/LE" to the encoding
in the advise message. Let's say the user used "Utf-16" as encoding.
 Should "BE/LE" be upper case or lower case? To avoid that question 
I made it always upper case.

I also would have liked to advise "UTF-16" instead of "UTF16" as
you suggested. However, that required a few more lines and I wanted
to keep the change to a minimum. I feel this could be added in a
follow up patch.

> On the other hand, if we are not enforcing such a policy decision
> but merely explaining a way to work around this check, then it may
> be better to give a variant with the smaller difference from the
> original (i.e. without up-casing).

See example mentioned above: "Utf-16". How would you handle that?

Thanks,
Lars