Re: (new) non-ASCII filenames break unit tests on Linux

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 08.12.2023 13:30, Michael Stahl wrote:
On 04/12/2023 13:05, Stephan Bergmann wrote:
On 12/4/23 12:10, Michael Stahl wrote:
On 03/12/2023 12:59, Stephan Bergmann wrote:
For better or worse, the payload of LO "internal" file URLs is always considered to be a UTF-8 encoding of the actual system pathname.  It is *not* a byte-for-byte representation of the bytes that make up the Unix system pathname.

What thus happens here is that the file UCP's TaskManager::getv -> osl::DirectoryItem::get -> osl_getDirectoryItem -> osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl -> decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert -> rtl_convertUnicodeToText tries to translate the Unicode chars of "hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() == RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has no representation of the Cyrillic and Greek letters.

in the "C" locale, every 8-bit value is valid, but only ASCII (<128) values are meaningful; the intent is that the application does not interpret file-names, but uses them as-is, and replacing characters with '?' (as apparently happens here) looks wrong to me.

probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.

That's not the issue here (the issue is that "ASCII has no representation of the Cyrillic and Greek letters"), and the existing RTL_TEXTENCODING_UTF8 would do what you seek on that conversion step from a Unicode file URL payload to a byte sequence pathname.

it cannot be converted to or interpreted as RTL_TEXTENCODING_UTF8 or anything else because the meaning of non-ASCII characters in "C" locale is unspecified.

... considering that LO uses UTF-16 strings for everything including file paths, perhaps the best thing would be to add a check for the "C" locale on startup, print an error and abort.

Note that the original issue discussed here is not the "C" locale, where the problem would be expected; nor any non-Unicode locale. But as Rene told, the locale was UTF-8, and the system handled the files OK.

--
Best regards,
Mike Kaganski




[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux