Re: [PATCH v3] tty/vt: UTF-8 parsing update according to RFC 3629, modern Unicode

Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> · Thu, 4 Jan 2024 16:28:47 +0100

On Tue, Dec 12, 2023 at 09:26:53PM +0100, Roman Žilka wrote:
> vc_translate_unicode() and vc_sanitize_unicode() parse input to the
> UTF-8-enabled console, marking invalid byte sequences and producing Unicode
> codepoints. The current algorithm follows ancient Unicode and may accept
> invalid byte sequences, pass on non-existent codepoints and reject valid
> sequences.
> 
> The patch restores the functions' compliance with modern Unicode (v15.1 [1]
> + many previous versions) as well as RFC 3629 [2].
> 1. Codepoint space is limited to 0x10FFFF.

Wait, why?  And shouldn't this be an individual patch on it's own?  What
is wrong with the checking we currently have.

> 2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in
>    Unicode and will be accepted.

Accepted when?

> Another option was to complete the set of
>    noncharacters (used to be just those two, now there's more) and preserve
>    the rejection step. This is indeed what Unicode suggests (v15.1, chap.
>    23.7) (not requires), but most codepoints are !iswprint(), so selecting
>    just the noncharacters seemed arbitrary and futile (and unnecessary).

What is this change going to break with existing systems that were
thinking these were invalid characters?

> On the side:
> 3. Corrected/improved the doc of the two functions (esp. @rescan).

Again, a separate commit.  When you have to list the changes out, that
is a huge hint it needs to be broken up into smaller pieces.

thanks,

greg k-h