Em Wed, 12 May 2021 10:25:35 +0100 David Woodhouse <dwmw2@xxxxxxxxxxxxx> escreveu: > On Wed, 2021-05-12 at 10:44 +0200, Mauro Carvalho Chehab wrote: > > The main point here is that a large amount of those UTF-8 characters > > appeared as result of document conversion from DocBook/LaTeX/Markdown. > > > > As the conversion ended, I don't expect the need of re-doing a series > > like that in the near future. > > > > There are even some cases where the UTF-8 were doing wrong things, like > > using an EN DASH instead of an hyphen in order to pass a command line > > parameter, and the addition of non-printable BOM characters. > > > > So, IMO, this is a necessarily cleanup after the conversion. > > That part — fixing characters that are *wrong*, such as converting a > UTF-8 U+2014 EM DASH to a UTF-8 U+002D HYPHEN-MINUS, is reasonable > enough. > > But you're not "avoiding using UTF-8 chars" there, as it says in the > title of this patch. HYPHEN-MINUS encoded as 0x2D *is* UTF-8. Yeah, you're right, as ASCII is a subset of UTF-8 - as ASCII is also subset of other charsets as well[1]. [1] ASCII is a subset for all charsets mentioned at: https://man7.org/linux/man-pages/man7/charsets.7.html A more precise title would be something like: Use ASCII instead of non-ASCII UTF-8 alternate symbols or Use ASCII subset instead of UTF-8 alternate symbols See, the goal of this series is to address the cases where there are multiple UTF-8 alternate symbols with the same meaning as the original ASCII set. Most of them were introduced by tools like DocBook/LaTeX/pandoc during document conversions[2], not by design, but just because the UTF-8 non-ASCII symbols produce a nicer output in html or pdf. In another words, it was a toolset decision to change them, diverging from what the author originally typed. [2] I suspect that a few of them could have been introduced as a result of someone using a text editor like libreoffice (or equivalent), that has a similar behavior. With ReST, there's no need to use any those, as the building tools will already do the such conversion when generating html/pdf output. So, better to stick with ASCII subset on such cases, as it allows to better use tools like grep and it makes easier to edit such files on editors like vi, nano, emacs, etc. Thanks, Mauro