Re: [PATCH v8 5/7] convert: add 'working-tree-encoding' attribute

Eric Sunshine <sunshine@xxxxxxxxxxxxx> · Wed, 28 Feb 2018 16:20:49 -0500

On Tue, Feb 27, 2018 at 6:16 AM, Lars Schneider
<larsxschneider@xxxxxxxxx> wrote:
>> On 25 Feb 2018, at 08:15, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote:
>> On Sat, Feb 24, 2018 at 11:27 AM,  <lars.schneider@xxxxxxxxxxxx> wrote:
>> The above paragraph is giving an example of the scenario described in
>> the paragraph above it. It might, therefore, be clearer to start this
>> paragraph with "For example,...". Also, as an aid to non-Windows
>> developers, it might be helpful to introduce ".proj" files as
>> "Microsoft Visual Studio `*.proj` files...".
>
> Agreed. How about this?
>
>   For example, Microsoft Visual Studio resources files (`*.rc`) or
>   PowerShell script files (`*.ps1`) are sometimes encoded in UTF-16.
>   If you declare `*.ps1` as files as UTF-16 and you add `foo.ps1` with
>   a `working-tree-encoding` enabled Git client, then `foo.ps1` will be
>   stored as UTF-8 internally. A client without `working-tree-encoding`
>   support will checkout `foo.ps1` as UTF-8 encoded file. This will
>   typically cause trouble for the users of this file.

Since "*.proj" is mentioned a number of times in paragraphs below this
one, perhaps just stick with it instead of introducing "*.rc" and
"*.ps1", which don't necessarily add value. Alternately, if you use
.rc and .ps1, then change the remaining .proj references to follow
suit.

>>> diff --git a/convert.c b/convert.c
>>> +       } else if (is_missing_required_utf_bom(enc->name, src, src_len)) {
>>> +               const char *error_msg = _(
>>> +                       "BOM is required in '%s' if encoded as %s");
>>> +               const char *advise_msg = _(
>>> +                       "The file '%s' is missing a byte order mark (BOM). "
>>> +                       "Please use %sBE or %sLE (depending on the byte order) "
>>> +                       "as working-tree-encoding.");
>>> +               advise(advise_msg, path, enc->name, enc->name);
>>> +               if (conv_flags & CONV_WRITE_OBJECT)
>>> +                       die(error_msg, path, enc->name);
>>> +               else
>>> +                       error(error_msg, path, enc->name);
>>> +       }
>>
>> For a function which presumably is agnostic about the working-tree
>> encoding (supporting whatever iconv does), it nevertheless has
>> unusually intimate knowledge about UTF BOMs, in general, and the
>> internals of has_prohibited_utf_bom() and
>> is_missing_required_utf_bom(), in particular. It might be cleaner, and
>> would simplify this function, if all this UTF BOM-specific gunk was
>> moved elsewhere and generalized into a "validate conversion" function.
>
> Agreed! How about this?
>
>         /*
>          * If we convert to an UTF encoding, then check for common BOM errors.
>          */
>         if (!memcmp("UTF-", enc, 4))
>                 if (has_utf_bom_error(path, enc, src, src_len, die_on_error))
>                         return 0;

Better. The comment merely repeats the code, which is clear in its
intent, thus doesn't really add value (but that's a minor point).

I'd probably generalize it further to a "validate conversion" function
(which itself would have this UTF-specific knowledge), but that's a
matter of taste and can be done later when/if other types of
validation requirements arise.