> On 07 Mar 2018, at 20:49, Junio C Hamano <gitster@xxxxxxxxx> wrote: > > lars.schneider@xxxxxxxxxxxx writes: > >> +static int validate_encoding(const char *path, const char *enc, >> + const char *data, size_t len, int die_on_error) >> +{ >> + /* We only check for UTF here as UTF?? can be an alias for UTF-?? */ >> + if (startscase_with(enc, "UTF")) { >> + /* >> + * Check for detectable errors in UTF encodings >> + */ >> + if (has_prohibited_utf_bom(enc, data, len)) { >> + const char *error_msg = _( >> + "BOM is prohibited in '%s' if encoded as %s"); >> + /* >> + * This advice is shown for UTF-??BE and UTF-??LE encodings. >> + * We cut off the last two characters of the encoding name >> + # to generate the encoding name suitable for BOMs. >> + */ > > Yuck. The code pretends to abstract away the details in a helper > has_prohibited_x() yet the caller still knows quite a lot. True, but has_prohibited_x() cannot create a proper error/advise message unless we give it more parameters (e.g. path name). Therefore, I don't see a better way right now. >> + const char *advise_msg = _( >> + "The file '%s' contains a byte order " >> + "mark (BOM). Please use %s as " >> + "working-tree-encoding."); >> + char *upper_enc = xstrdup_toupper(enc); >> + upper_enc[strlen(upper_enc)-2] = '\0'; >> + advise(advise_msg, path, upper_enc); >> + free(upper_enc); > > I think this up-casing is more problematic than without, not from > the point of view of the internal code, but from the point of view > of the end user experience. When the user writes utf16le or > utf-16le and the data does not trigger the BOM check, we are likely > to successfully convert it. I do not see the merit of suggesting > UTF16 or UTF-16 in such a case, over telling them to just drop the > byte-order suffix from the encoding names (i.e. utf16 or utf-16). > > If you are trying to force/nudge people in the direction of > canonical way of spelling things (which may not be a bad idea), then > "utf16le" as the original input would want to result in "UTF-16" > with dash in the advise, no? Correct. In the error messages I kept the encoding name "as-is" and only in the advise message I used the uppercase variant to steer people into the canonical direction. My initial reason for this was that in is_missing_required_utf_bom() we add "BE/LE" to the encoding in the advise message. Let's say the user used "Utf-16" as encoding. Should "BE/LE" be upper case or lower case? To avoid that question I made it always upper case. I also would have liked to advise "UTF-16" instead of "UTF16" as you suggested. However, that required a few more lines and I wanted to keep the change to a minimum. I feel this could be added in a follow up patch. > On the other hand, if we are not enforcing such a policy decision > but merely explaining a way to work around this check, then it may > be better to give a variant with the smaller difference from the > original (i.e. without up-casing). See example mentioned above: "Utf-16". How would you handle that? Thanks, Lars