Em Sat, 8 May 2021 08:55:11 -0700 Randy Dunlap <rdunlap@xxxxxxxxxxxxx> escreveu: > > In the mean time, I'm already preparing a patch series addressing > > the issues inside documentation, using some scripting to avoid > > manual mistakes: > > > > https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8 > > > > (patch series is not 100% yet... some adjustments are still > > needed on some places). > > > Thanks for digging into this and providing fixes. Just pushed a new version there, rebasing the branch: https://git.linuxtv.org/mchehab/experimental.git/log/?h=fix_utf8 The first tree patches were manually written, in order to address a couple of special cases. I'll be submitting the patches via e-mail later today. The remaining ones were generated by a script that seeks for UTF-8 characters only inside Documentation .rst and ABI files, doing this conversion: my %char_map = ( 0x2010 => '-', # HYPHEN 0xad => '-', # SOFT HYPHEN 0x2013 => '-', # EN DASH 0x2014 => '-', # EM DASH 0x2018 => "'", # LEFT SINGLE QUOTATION MARK 0x2019 => "'", # RIGHT SINGLE QUOTATION MARK 0xb4 => "'", # ACUTE ACCENT 0x201c => '"', # LEFT DOUBLE QUOTATION MARK 0x201d => '"', # RIGHT DOUBLE QUOTATION MARK 0x2212 => '-', # MINUS SIGN 0x2217 => '*', # ASTERISK OPERATOR 0xd7 => 'x', # MULTIPLICATION SIGN 0xbb => '>', # RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK 0xa0 => ' ', # NO-BREAK SPACE 0xfeff => '', # ZERO WIDTH NO-BREAK SPACE ); Basically, after the conversion, those UTF-8 chars will remain at Documentation/: - U+00a9 ('©'): COPYRIGHT SIGN - U+00ac ('¬'): NOT SIGN # only at Documentation/powerpc/transactional_memory.rst - U+00ae ('®'): REGISTERED SIGN - U+00b0 ('°'): DEGREE SIGN - U+00b1 ('±'): PLUS-MINUS SIGN - U+00b2 ('²'): SUPERSCRIPT TWO - U+00b5 ('µ'): MICRO SIGN - U+00b7 ('·'): MIDDLE DOT # See below - U+00bd ('½'): VULGAR FRACTION ONE HALF - U+00c7 ('Ç'): LATIN CAPITAL LETTER C WITH CEDILLA - U+00df ('ß'): LATIN SMALL LETTER SHARP S - U+00e1 ('á'): LATIN SMALL LETTER A WITH ACUTE - U+00e4 ('ä'): LATIN SMALL LETTER A WITH DIAERESIS - U+00e6 ('æ'): LATIN SMALL LETTER AE - U+00e7 ('ç'): LATIN SMALL LETTER C WITH CEDILLA - U+00e9 ('é'): LATIN SMALL LETTER E WITH ACUTE - U+00ea ('ê'): LATIN SMALL LETTER E WITH CIRCUMFLEX - U+00eb ('ë'): LATIN SMALL LETTER E WITH DIAERESIS - U+00f3 ('ó'): LATIN SMALL LETTER O WITH ACUTE - U+00f4 ('ô'): LATIN SMALL LETTER O WITH CIRCUMFLEX - U+00f6 ('ö'): LATIN SMALL LETTER O WITH DIAERESIS - U+00f8 ('ø'): LATIN SMALL LETTER O WITH STROKE - U+00fa ('ú'): LATIN SMALL LETTER U WITH ACUTE - U+00fc ('ü'): LATIN SMALL LETTER U WITH DIAERESIS - U+00fd ('ý'): LATIN SMALL LETTER Y WITH ACUTE - U+011f ('ğ'): LATIN SMALL LETTER G WITH BREVE - U+0142 ('ł'): LATIN SMALL LETTER L WITH STROKE - U+03bc ('μ'): GREEK SMALL LETTER MU - U+2026 ('…'): HORIZONTAL ELLIPSIS - U+2122 ('™'): TRADE MARK SIGN - U+2191 ('↑'): UPWARDS ARROW - U+2192 ('→'): RIGHTWARDS ARROW - U+2193 ('↓'): DOWNWARDS ARROW - U+2264 ('≤'): LESS-THAN OR EQUAL TO - U+2265 ('≥'): GREATER-THAN OR EQUAL TO - U+2500 ('─'): BOX DRAWINGS LIGHT HORIZONTAL - U+2502 ('│'): BOX DRAWINGS LIGHT VERTICAL - U+2514 ('└'): BOX DRAWINGS LIGHT UP AND RIGHT - U+251c ('├'): BOX DRAWINGS LIGHT VERTICAL AND RIGHT - U+2b0d ('⬍'): UP DOWN BLACK ARROW For U+00b7 ('·'): MIDDLE DOT, I opted to keep it on a few places: - Documentation/devicetree/bindings/clock/qcom,rpmcc.txt As this file will be some day converted to yaml, where the MIDDLE DOT will be removed, I guess it is not worth touching it. - Documentation/scheduler/sched-deadline.rst There, it is used on a math expressions. So, better to keep. - Documentation/devicetree/bindings/media/video-interface-devices.yaml There, it part of an ASCII artwork. - translations/zh_CN I prefer not touching it, as it might have some special meaning in Simplified Chinese. Thanks, Mauro