Re: [PATCH/RFC v1 1/1] Support working-tree-encoding "UTF-16LE-BOM"

Philip Oakley <philipoakley@xxxxxxxxxxxx> · Sat, 29 Dec 2018 17:54:49 +0000

(adding Brian as cc who was in the original thread)

On 29/12/2018 15:48, Adrián Gimeno Balaguer wrote:
Hello again.

I appreciate the grown interest in this issue.

Torsten, may I know what is the benefit on your code? My PR solved it
by only tweaking the utf8.c's function 'has_prohibited_utf_bom', which
is likely the shortest way:

https://github.com/git/git/pull/550/files

My main complaint with the PR would be the lack of documentation updates.

As the discussion has highlighted, whatever our solution, we will need 
to tell the users in plain and simple terms which parts of which 
standards are being used, and why we need to be somehow 'different'.

That is because a revision control system must be able to recover the 
original, for use in the original software tool, not just interpret it 
is some alternate form. The standards generally abdicate responsibility 
for the last step ;-)

I did not fully understand the conversion process you proposed, as I 
assumed(?) that on receipt of the source file, the Git conversion to 
utf-8 would convert the 16-bit BOM to the three byte utf-8 BOM byte 
sequence `EF BB BF` which has lost any knowledge of the original BE/LE 
coding.

Or, are we saying that the the 16-bit BOM is being interpreted as, a) 
the BE/LE indicator and b) a genuine "ZERO WIDTH NON-BREAKING SPACE" 
which is stored as the two byte utf-8 character code, again loosing 
(once stored as a blob object) the BE/LE indication.

Or, we see the BOM, note the endianness and then loose the BOM character 
when converting to utf-8. My ignorance of this step is starting to show. 
Regular users are probably even more confused, hence my hope for some 
documentation.

Given the above confusions, and many more when exploring the internet, 
the provision of a new, extra, clear, name for the encoding, as 
suggested by Torsten does offer an advantage in that it explicitly 
(rather than implicitly) makes plain what we are trying to do, without 
squeezing it in 'under the radar'.

That said, assuming an appropriate internal utf-8 Git coding that does 
remember the BE/LE state [if so how?] then the PR is a neat trick.

Torsten's patch also suffers from the lack of user facing documentation.

In order to make sure everything is clear, here is a case list of
current Git behaviour and new one after my PR, regarding this issue.

Current behaviour:

- Placing 'test.txt working-tree-encoding=UTF-16' for a new test.txt
file with either UTF-16 BE or LE BOM, and comitting everything -> The
file gets re-encoded from UTF-8 (as stored internally), to UTF-16 and
the default system/libiconv endianness -> Problem (as long as user
required the opposite endianness for any reason on his project). As a
note, user can see however human-readable diffs on that file.

- Placing  'test.txt working-tree-encoding=UTF-16LE' or 'test.txt
working-tree-encoding=UTF-16BE' for a new test.txt file with either
UTF-16 BE or LE BOM, and comitting everything: we assume user is doing
this because he requires that exact endianness, thus he writes it in
order to attempt preserving it -> Git prohibites commiting it, also no
human-readable diff is shown in the diff viewer/tool being used, but
file is simply shown as binary.

New behaviour:

-  Just got too lazy to repeat it all over, read my PR description:
https://github.com/git/git/pull/550

"In this PR: Git only prohibites the opposite BOM than the one in 
working-tree-encoding (e.g. if declared LE, then it denies BE BOM 
presence within the associated file, of the declared UTF-16/UTF-32). 
This way the user can now make Git operations which were previously 
impossible, with the only requisite being to match the endianness of 
working-tree-encoding attribute with the associated file/s."

- Git translations may need to be tweaked to in order to be consistent
with new behaviour.

Thanks for your attention.

--
Philip