Re: [PATCH v10 7/9] convert: check for detectable errors in UTF encodings

Lars Schneider <larsxschneider@xxxxxxxxx> · Fri, 9 Mar 2018 18:02:46 +0100

> On 07 Mar 2018, at 19:04, Eric Sunshine <sunshine@xxxxxxxxxxxxxx> wrote:
> 
> On Wed, Mar 7, 2018 at 12:30 PM,  <lars.schneider@xxxxxxxxxxxx> wrote:
>> Check that new content is valid with respect to the user defined
>> 'working-tree-encoding' attribute.
>> 
>> Signed-off-by: Lars Schneider <larsxschneider@xxxxxxxxx>
>> ---
>> diff --git a/convert.c b/convert.c
>> @@ -266,6 +266,58 @@ static int will_convert_lf_to_crlf(size_t len, struct text_stat *stats,
>> +static int validate_encoding(const char *path, const char *enc,
>> +                     const char *data, size_t len, int die_on_error)
>> +{
>> +       /* We only check for UTF here as UTF?? can be an alias for UTF-?? */
>> +       if (startscase_with(enc, "UTF")) {
>> +               /*
>> +                * Check for detectable errors in UTF encodings
>> +                */
>> +               if (has_prohibited_utf_bom(enc, data, len)) {
>> +                       const char *error_msg = _(
>> +                               "BOM is prohibited in '%s' if encoded as %s");
>> +                       /*
>> +                        * This advice is shown for UTF-??BE and UTF-??LE encodings.
>> +                        * We cut off the last two characters of the encoding name
>> +                        # to generate the encoding name suitable for BOMs.
> 
> s/#/*/

Of course!

>> diff --git a/t/t0028-working-tree-encoding.sh b/t/t0028-working-tree-encoding.sh
>> @@ -62,6 +62,46 @@ test_expect_success 'check $GIT_DIR/info/attributes support' '
>> for i in 16 32
>> do
>> +       test_expect_success "check prohibited UTF-${i} BOM" '
>> +               test_when_finished "git reset --hard HEAD" &&
>> +
>> +               echo "*.utf${i}be text working-tree-encoding=utf-${i}be" >>.gitattributes &&
>> +               echo "*.utf${i}le text working-tree-encoding=utf-${i}le" >>.gitattributes &&
> 
> v10 is checking only hyphenated lowercase encoding name; earlier
> versions checked uppercase. For better coverage, it would be nice to
> check several combinations: all uppercase, all lowercase, mixed case,
> hyphenated, not hyphenated.
> 
> I'm not suggesting running all the tests repeatedly but rather just
> varying the format of the encoding name in these tests you're adding.
> For instance, the above could instead be:
> 
>    echo "*.utf${i}be text working-tree-encoding=UTF-${i}be" >>.gitattributes &&
>    echo "*.utf${i}le text working-tree-encoding=utf${i}LE" >>.gitattributes &&
> 
> or something.

The casing is a good idea - I will do that. I don't want to do "hyphenated, not 
hyphenated" as this would make the tests fail on macOS (and I believe on Windows).

Thanks,
Lars