Re: [PATCH v3] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Dec 12, 2023 at 09:26:53PM +0100, Roman Žilka wrote:
> vc_translate_unicode() and vc_sanitize_unicode() parse input to the
> UTF-8-enabled console, marking invalid byte sequences and producing Unicode
> codepoints. The current algorithm follows ancient Unicode and may accept
> invalid byte sequences, pass on non-existent codepoints and reject valid
> sequences.
> 
> The patch restores the functions' compliance with modern Unicode (v15.1 [1]
> + many previous versions) as well as RFC 3629 [2].
> 1. Codepoint space is limited to 0x10FFFF.

Wait, why?  And shouldn't this be an individual patch on it's own?  What
is wrong with the checking we currently have.

> 2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in
>    Unicode and will be accepted.

Accepted when?

> Another option was to complete the set of
>    noncharacters (used to be just those two, now there's more) and preserve
>    the rejection step. This is indeed what Unicode suggests (v15.1, chap.
>    23.7) (not requires), but most codepoints are !iswprint(), so selecting
>    just the noncharacters seemed arbitrary and futile (and unnecessary).

What is this change going to break with existing systems that were
thinking these were invalid characters?

> On the side:
> 3. Corrected/improved the doc of the two functions (esp. @rescan).

Again, a separate commit.  When you have to list the changes out, that
is a huge hint it needs to be broken up into smaller pieces.

thanks,

greg k-h




[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux PPP]     [Linux FS]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Linmodem]     [Device Mapper]     [Linux Kernel for ARM]

  Powered by Linux