On Tue, Dec 12, 2023 at 09:26:53PM +0100, Roman Žilka wrote: > vc_translate_unicode() and vc_sanitize_unicode() parse input to the > UTF-8-enabled console, marking invalid byte sequences and producing Unicode > codepoints. The current algorithm follows ancient Unicode and may accept > invalid byte sequences, pass on non-existent codepoints and reject valid > sequences. > > The patch restores the functions' compliance with modern Unicode (v15.1 [1] > + many previous versions) as well as RFC 3629 [2]. > 1. Codepoint space is limited to 0x10FFFF. Wait, why? And shouldn't this be an individual patch on it's own? What is wrong with the checking we currently have. > 2. "Noncharacters", such as U+FFFE, U+FFFF, are no longer invalid in > Unicode and will be accepted. Accepted when? > Another option was to complete the set of > noncharacters (used to be just those two, now there's more) and preserve > the rejection step. This is indeed what Unicode suggests (v15.1, chap. > 23.7) (not requires), but most codepoints are !iswprint(), so selecting > just the noncharacters seemed arbitrary and futile (and unnecessary). What is this change going to break with existing systems that were thinking these were invalid characters? > On the side: > 3. Corrected/improved the doc of the two functions (esp. @rescan). Again, a separate commit. When you have to list the changes out, that is a huge hint it needs to be broken up into smaller pieces. thanks, greg k-h