On 04/12/2023 13:05, Stephan Bergmann wrote:
On 12/4/23 12:10, Michael Stahl wrote:
On 03/12/2023 12:59, Stephan Bergmann wrote:
For better or worse, the payload of LO "internal" file URLs is always
considered to be a UTF-8 encoding of the actual system pathname. It
is *not* a byte-for-byte representation of the bytes that make up the
Unix system pathname.
What thus happens here is that the file UCP's TaskManager::getv ->
osl::DirectoryItem::get -> osl_getDirectoryItem ->
osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl ->
decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert ->
rtl_convertUnicodeToText tries to translate the Unicode chars of
"hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() ==
RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has
no representation of the Cyrillic and Greek letters.
in the "C" locale, every 8-bit value is valid, but only ASCII (<128)
values are meaningful; the intent is that the application does not
interpret file-names, but uses them as-is, and replacing characters
with '?' (as apparently happens here) looks wrong to me.
probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.
That's not the issue here (the issue is that "ASCII has no
representation of the Cyrillic and Greek letters"), and the existing
RTL_TEXTENCODING_UTF8 would do what you seek on that conversion step
from a Unicode file URL payload to a byte sequence pathname.
it cannot be converted to or interpreted as RTL_TEXTENCODING_UTF8 or
anything else because the meaning of non-ASCII characters in "C" locale
is unspecified.
... considering that LO uses UTF-16 strings for everything including
file paths, perhaps the best thing would be to add a check for the "C"
locale on startup, print an error and abort.