On 12/4/23 12:10, Michael Stahl wrote:
On 03/12/2023 12:59, Stephan Bergmann wrote:
For better or worse, the payload of LO "internal" file URLs is always
considered to be a UTF-8 encoding of the actual system pathname. It
is *not* a byte-for-byte representation of the bytes that make up the
Unix system pathname.
What thus happens here is that the file UCP's TaskManager::getv ->
osl::DirectoryItem::get -> osl_getDirectoryItem ->
osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl ->
decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert ->
rtl_convertUnicodeToText tries to translate the Unicode chars of
"hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() ==
RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has no
representation of the Cyrillic and Greek letters.
in the "C" locale, every 8-bit value is valid, but only ASCII (<128)
values are meaningful; the intent is that the application does not
interpret file-names, but uses them as-is, and replacing characters with
'?' (as apparently happens here) looks wrong to me.
probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.
That's not the issue here (the issue is that "ASCII has no
representation of the Cyrillic and Greek letters"), and the existing
RTL_TEXTENCODING_UTF8 would do what you seek on that conversion step
from a Unicode file URL payload to a byte sequence pathname.