Re: [PATCH v2 00/40] Use ASCII subset instead of UTF-8 alternate symbols

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Sat, 15 May 2021 10:24:28 +0100

On Sat, 2021-05-15 at 10:22 +-0200, Mauro Carvalho Chehab wrote:
+AD4 +AD4 +AD4      Here, +ADw-CTRL+AD4APA-SHIFT+AD4-U is not working. No idea why. I haven't 
+AD4 +AD4 +AD4      test it for +ACo-years+ACo, as I din't see any reason why I would
+AD4 +AD4 +AD4      need to type UTF-8 characters by numbers until we started
+AD4 +AD4 +AD4      this thread.  
+AD4 +AD4 
+AD4 +AD4 Please provide the bug number for this+ADs I'd like to track it.
+AD4 
+AD4 Just opened a BZ and added you as c/c.

Thanks.

+AD4 Let's take one step back, in order to return to the intents of this
+AD4 UTF-8, as the discussions here are not centered into the patches, but
+AD4 instead, on what to do and why.
+AD4 
+AD4 -
+AD4 
+AD4 This discussion started originally at linux-doc ML.
+AD4 
+AD4 While discussing about an issue when machine's locale was not set
+AD4 to UTF-8 on a build VM, 

Stop. Stop +ACo-right+ACo there before you go any further.

The machine's locale should have +ACo-nothing+ACo to do with anything.

When you view this email, it comes with a Content-Type: header which
explicitly tells you the character set that the message is encoded in, 
which I think I've set to UTF-7.

When showing you the mail, your system has to interpret the bytes of
the content using +ACo-that+ACo character set encoding. Anything else is just
fundamentally broken. Your system locale has +ACo-nothing+ACo to do with it.

If your local system is running EBCDIC that doesn't +ACo-matter+ACo.

Now, the character set encoding of the kernel source and documentation
text files is UTF-8. It isn't EBCDIC, it isn't ISO8859-15 or any of the
legacy crap. It isn't system locale either, unless your system locale
+ACo-happens+ACo to be UTF-8.

UTF-8 +ACo-happens+ACo to be compatible with ASCII for the limited subset of
characters which ASCII contains, sure +IBQ just as +ACo-many+ACo, but not all, of
the legacy 8-bit character sets are also a superset of ASCII's 7 bits.

But if the docs contain +ACo-any+ACo characters which aren't ASCII, and you
build them with a broken build system which assumes ASCII, you are
going to produce wrong output. There is +ACo-no+ACo substitute for fixing the
+ACo-actual+ACo bug which started all this, and ensuring your build system (or
whatever) uses the +ACo-actual+ACo encoding of the text files it's processing,
instead of making stupid and bogus assumptions based on a system
default.

You concede keeping U+-00a9 +AKk COPYRIGHT SIGN. And that's encoded in UTF-
8 as two bytes 0xC2 0xA9. If some broken build system +ACo-assumes+ACo those
bytes are ISO8859-15 it'll take them to mean two separate characters

    U+-00C2 +AMI LATIN CAPITAL LETTER A WITH CIRCUMFLEX
    U+-00A9 +AKk COPYRIGHT SIGN

Your broken build system that started all this is never going to be
+ACo-anything+ACo other than broken. You can only paper over the cracks and
make it slightly less likely that people will notice in the common
case, perhaps? That's all you do by +ACo-reducing+ACo the use of non-ASCII,
unless you're going to drag us all the way back to the 1980s and
strictly limit us to pure ASCII, using the equivalent of trigraphs for
+ACo-anything+ACo outside the 0-127 character ranges.

And even if you did that, systems which use EBCDIC as their local
encoding would +ACo-still+ACo be broken, if they have the same bug you started
from. Because EBCDIC isn't compatible with ASCII +ACo-even+ACo for the first 7
bits.

+AD4 we discovered that some converted docs ended
+AD4 with BOM characters. Those specific changes were introduced by some
+AD4 of my convert patches, probably converted via pandoc.
+AD4 
+AD4 So, I went ahead in order to check what other possible weird things
+AD4 were introduced by the conversion, where several scripts and tools
+AD4 were used on files that had already a different markup.
+AD4 
+AD4 I actually checked the current UTF-8 issues, and asked people at
+AD4 linux-doc to comment what of those are valid usecases, and what
+AD4 should be replaced by plain ASCII.

No, these aren't +ACI-UTF-8 issues+ACI. Those are +ACo-conversion+ACo issues, and
would still be there if the output of the conversion had been UTF-7,
UCS-16, etc. Or +ACo-even+ACo if the output of the conversion had been
trigraph-like stuff like '--' for emdash. It's +ACo-nothing+ACo to do with the
encoding that we happen to be using.

Fixing the conversion issues makes a lot of sense. Try to do it without
making +ACo-any+ACo mention of UTF-8 at all.

+AD4 In summary, based on the discussions we have so far, I suspect that
+AD4 there's not much to be discussed for the above cases.
+AD4 
+AD4 So, I'll post a v3 of this series, changing only:
+AD4 
+AD4         - U+-00a0 (' '): NO-BREAK SPACE
+AD4         - U+-feff ('+/v8'): ZERO WIDTH NO-BREAK SPACE (BOM)

Ack, as long as those make +ACo-no+ACo mention of UTF-8. Except perhaps to
note that BOM is redundant because UTF-8 doesn't have a byteorder.

+AD4 ---
+AD4 
+AD4 Now, this specific patch series address also this extra case:
+AD4 
+AD4 5. curly commas:
+AD4 
+AD4         - U+-2018 ('+IBg'): LEFT SINGLE QUOTATION MARK
+AD4         - U+-2019 ('+IBk'): RIGHT SINGLE QUOTATION MARK
+AD4         - U+-201c ('+IBw'): LEFT DOUBLE QUOTATION MARK
+AD4         - U+-201d ('+IB0'): RIGHT DOUBLE QUOTATION MARK
+AD4 
+AD4 IMO, those should be replaced by ASCII commas: ' and +ACI.
+AD4 
+AD4 The rationale is simple: 
+AD4 
+AD4 - most were introduced during the conversion from Docbook,
+AD4   markdown and LaTex+ADs
+AD4 - they don't add any extra value, as using +ACI-foo+ACI of +IBw-foo+IB0 means
+AD4   the same thing+ADs
+AD4 - Sphinx already use +ACI-fancy+ACI commas at the output. 
+AD4 
+AD4 I guess I will put this on a separate series, as this is not a bug
+AD4 fix, but just a cleanup from the conversion work.
+AD4 
+AD4 I'll re-post those cleanups on a separate series, for patch per patch
+AD4 review.

Makes sense. 

The left/right quotation marks exists to make human-readable text much
easier to read, but the key point here is that they are redundant
because the tooling already emits them in the +ACo-output+ACo so they don't
need to be in the source, yes?

As long as the tooling gets it +ACo-right+ACo and uses them where it should,
that seems sane enough.

However, it +ACo-does+ACo break 'grep', because if I cut/paste a snippet from
the documentation and try to grep for it, it'll no longer match.

Consistency is good, but perhaps we should actually be consistent the
other way round and always use the left/right versions in the source
+ACo-instead+ACo of relying on the tooling, to make searches work better?
You claimed to care about that, right?

+AD4 The remaining cases are future work, outside the scope of this v2:
+AD4 
+AD4 6. Hyphen/Dashes and ellipsis
+AD4 
+AD4         - U+-2212 ('+IhI'): MINUS SIGN
+AD4         - U+-00ad ('+AK0'): SOFT HYPHEN
+AD4         - U+-2010 ('+IBA'): HYPHEN
+AD4 
+AD4             Those three are used on places where a normal ASCII hyphen/minus
+AD4             should be used instead. There are even a couple of C files which
+AD4             use them instead of '-' on comments.
+AD4 
+AD4             IMO are fixes/cleanups from conversions and bad cut-and-paste.

That seems to make sense.

+AD4         - U+-2013 ('+IBM'): EN DASH
+AD4         - U+-2014 ('+IBQ'): EM DASH
+AD4         - U+-2026 ('+ICY'): HORIZONTAL ELLIPSIS
+AD4 
+AD4             Those are auto-replaced by Sphinx from +ACI---+ACI, +ACI----+ACI and +ACI...+ACI,
+AD4             respectively.
+AD4 
+AD4             I guess those are a matter of personal preference about
+AD4             weather using ASCII or UTF-8.
+AD4 
+AD4             My personal preference (and Ted seems to have a similar
+AD4             opinion) is to let Sphinx do the conversion.
+AD4 
+AD4             For those, I intend to post a separate series, to be
+AD4             reviewed patch per patch, as this is really a matter
+AD4             of personal taste. Hardly we'll reach a consensus here.
+AD4 

Again using the trigraph-like '--' and '...' instead of just using the
plain text '+IBQ' and '+ICY' breaks searching, because what's in the output
doesn't match the input. Again consistency is good, but perhaps we
should standardise on just putting these in their plain text form
instead of the trigraphs?

+AD4 7. math symbols:
+AD4 
+AD4         - U+-00d7 ('+ANc'): MULTIPLICATION SIGN
+AD4 
+AD4            This one is used mostly do describe video resolutions, but this is
+AD4            on a smaller changeset than the ones that use +ACI-x+ACI letter.

I think standardising on +ANc for video resolutions in documentation would
make it look better and be easier to read.

+AD4 
+AD4         - U+-2217 ('+Ihc'): ASTERISK OPERATOR
+AD4 
+AD4            This is used only here:
+AD4                 Documentation/filesystems/ext4/blockgroup.rst:filesystem size to 2+AF4-21 +Ihc 2+AF4-27 +AD0 2+AF4-48bytes or 256TiB.
+AD4 
+AD4            Probably added by some conversion tool. IMO, this one should
+AD4            also be replaced by an ASCII asterisk.
+AD4 
+AD4 I guess I'll post a patch for the ASTERISK OPERATOR.

That makes sense.
Attachment:
smime.p7s

Description: S/MIME cryptographic signature