Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 07 Mar 2018 11:49:27 -0800

lars.schneider@xxxxxxxxxxxx writes:

> +static int validate_encoding(const char *path, const char *enc,
> +		      const char *data, size_t len, int die_on_error)
> +{
> +	/* We only check for UTF here as UTF?? can be an alias for UTF-?? */
> +	if (startscase_with(enc, "UTF")) {
> +		/*
> +		 * Check for detectable errors in UTF encodings
> +		 */
> +		if (has_prohibited_utf_bom(enc, data, len)) {
> +			const char *error_msg = _(
> +				"BOM is prohibited in '%s' if encoded as %s");
> +			/*
> +			 * This advice is shown for UTF-??BE and UTF-??LE encodings.
> +			 * We cut off the last two characters of the encoding name
> +			 # to generate the encoding name suitable for BOMs.
> +			 */

Yuck.  The code pretends to abstract away the details in a helper
has_prohibited_x() yet the caller still knows quite a lot.

> +			const char *advise_msg = _(
> +				"The file '%s' contains a byte order "
> +				"mark (BOM). Please use %s as "
> +				"working-tree-encoding.");
> +			char *upper_enc = xstrdup_toupper(enc);
> +			upper_enc[strlen(upper_enc)-2] = '\0';
> +			advise(advise_msg, path, upper_enc);
> +			free(upper_enc);

I think this up-casing is more problematic than without, not from
the point of view of the internal code, but from the point of view
of the end user experience.  When the user writes utf16le or
utf-16le and the data does not trigger the BOM check, we are likely
to successfully convert it.  I do not see the merit of suggesting
UTF16 or UTF-16 in such a case, over telling them to just drop the
byte-order suffix from the encoding names (i.e. utf16 or utf-16).

If you are trying to force/nudge people in the direction of
canonical way of spelling things (which may not be a bad idea), then
"utf16le" as the original input would want to result in "UTF-16"
with dash in the advise, no?

On the other hand, if we are not enforcing such a policy decision
but merely explaining a way to work around this check, then it may
be better to give a variant with the smaller difference from the
original (i.e. without up-casing).