Re: (new) non-ASCII filenames break unit tests on Linux

Stephan Bergmann <stephan.bergmann@xxxxxxxxxxxxx> · Mon, 4 Dec 2023 13:05:39 +0100

On 12/4/23 12:10, Michael Stahl wrote:
On 03/12/2023 12:59, Stephan Bergmann wrote:
For better or worse, the payload of LO "internal" file URLs is always 
considered to be a UTF-8 encoding of the actual system pathname.  It 
is *not* a byte-for-byte representation of the bytes that make up the 
Unix system pathname.

What thus happens here is that the file UCP's TaskManager::getv -> 
osl::DirectoryItem::get -> osl_getDirectoryItem -> 
osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl -> 
decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert -> 
rtl_convertUnicodeToText tries to translate the Unicode chars of 
"hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() == 
RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has no 
representation of the Cyrillic and Greek letters.

in the "C" locale, every 8-bit value is valid, but only ASCII (<128) 
values are meaningful; the intent is that the application does not 
interpret file-names, but uses them as-is, and replacing characters with 
'?' (as apparently happens here) looks wrong to me.

probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.

That's not the issue here (the issue is that "ASCII has no 
representation of the Cyrillic and Greek letters"), and the existing 
RTL_TEXTENCODING_UTF8 would do what you seek on that conversion step 
from a Unicode file URL payload to a byte sequence pathname.