Hi David, Em Mon, 10 May 2021 11:54:02 +0100 David Woodhouse <dwmw2@xxxxxxxxxxxxx> escreveu: > On Mon, 2021-05-10 at 12:26 +0200, Mauro Carvalho Chehab wrote: > > There are several UTF-8 characters at the Kernel's documentation. > > > > Several of them were due to the process of converting files from > > DocBook, LaTeX, HTML and Markdown. They were probably introduced > > by the conversion tools used on that time. > > > > Other UTF-8 characters were added along the time, but they're easily > > replaceable by ASCII chars. > > > > As Linux developers are all around the globe, and not everybody has UTF-8 > > as their default charset, better to use UTF-8 only on cases where it is really > > needed. > > No, that is absolutely the wrong approach. > > If someone has a local setup which makes bogus assumptions about text > encodings, that is their own mistake. > > We don't do them any favours by trying to *hide* it in the common case > so that they don't notice it for longer. > > There really isn't much excuse for such brokenness, this far into the > 21st century. > > Even *before* UTF-8 came along in the final decade of the last > millennium, it was important to know which character set a given piece > of text was encoded in. > > In fact it was even *more* important back then, we couldn't just assume > UTF-8 everywhere like we can in modern times. > > Git can already do things like CRLF conversion on checking files out to > match local conventions; if you want to teach it to do character set > conversions too then I suppose that might be useful to a few developers > who've fallen through a time warp and still need it. But nobody's ever > bothered before because it just isn't necessary these days. > > Please *don't* attempt to address this anachronistic and esoteric > "requirement" by dragging the kernel source back in time by three > decades. No. The idea is not to go back three decades ago. The goal is just to avoid use UTF-8 where it is not needed. See, the vast majority of UTF-8 chars are kept: - Non-ASCII Latin and Greek chars; - Box drawings; - arrows; - most symbols. There, it makes perfect sense to keep using UTF-8. We should keep using UTF-8 on Kernel. This is something that it shouldn't be changed. --- This patch series is doing conversion only when using ASCII makes more sense than using UTF-8. See, a number of converted documents ended with weird characters like ZERO WIDTH NO-BREAK SPACE (U+FEFF) character. This specific character doesn't do any good. Others use NO-BREAK SPACE (U+A0) instead of 0x20. Harmless, until someone tries to use grep[1]. [1] try to run: $ git grep "CPU 0 has been" Documentation/RCU/ it will return nothing with current upstream. But it will work fine after the series is applied: $ git grep "CPU 0 has been" Documentation/RCU/ Documentation/RCU/Design/Data-Structures/Data-Structures.rst:| #. CPU 0 has been in dyntick-idle mode for quite some time. When it | Documentation/RCU/Design/Data-Structures/Data-Structures.rst:| notices that CPU 0 has been in dyntick idle mode, which qualifies | The main point on this series is to replace just the occurrences where ASCII represents the symbol equally well, e. g. it is limited for those chars: - U+2010 ('‐'): HYPHEN - U+00ad (''): SOFT HYPHEN - U+2013 ('–'): EN DASH - U+2014 ('—'): EM DASH - U+2018 ('‘'): LEFT SINGLE QUOTATION MARK - U+2019 ('’'): RIGHT SINGLE QUOTATION MARK - U+00b4 ('´'): ACUTE ACCENT - U+201c ('“'): LEFT DOUBLE QUOTATION MARK - U+201d ('”'): RIGHT DOUBLE QUOTATION MARK - U+00d7 ('×'): MULTIPLICATION SIGN - U+2212 ('−'): MINUS SIGN - U+2217 ('∗'): ASTERISK OPERATOR (this one used as a pointer reference like "*foo" on C code example inside a document converted from LaTeX) - U+00bb ('»'): RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK (this one also used wrongly on an ABI file, meaning '>') - U+00a0 (' '): NO-BREAK SPACE - U+feff (''): ZERO WIDTH NO-BREAK SPACE Using the above symbols will just trick tools like grep for no good reason. Thanks, Mauro