Re: (new) non-ASCII filenames break unit tests on Linux

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 03/12/2023 12:59, Stephan Bergmann wrote:
On 12/2/23 16:38, Mike Kaganski wrote:
On 02.12.2023 17:46, Rene Engelhard wrote:
In any case this is bad. My filesystem (I think from 2020 or so) apparently shows it (ls -l does) but I wouldn't be sure for other, old ones (like Debians build machines). The locale this fails under definitely is UTF-8 though.

Pre <https://git.libreoffice.org/core/+/fbf025b4903bfcb93c3d4bbf1ebbf860cf11618d%5E%21> "Make testHybridPDFFile Windows-only, and filenames in repo ASCII-only", I can reproduce the failure on Linux when not using an UTF-8 locale but explicitly specifying an e.g. ASCII locale (and thus an osl_getThreadTextEncoding value of RTL_TEXTENCODING_ASCII_US) with `LC_CTYPE=C make -O CppunitTest_filter_textfilterdetect CPPUNIT_TEST_NAME=testHybridPDFFile::TestBody`.

But if someone has an idea why LibreOffice fails handling files that exist on system, with names representable in system encoding, it would be nice.

For better or worse, the payload of LO "internal" file URLs is always considered to be a UTF-8 encoding of the actual system pathname.  It is *not* a byte-for-byte representation of the bytes that make up the Unix system pathname.

What thus happens here is that the file UCP's TaskManager::getv -> osl::DirectoryItem::get -> osl_getDirectoryItem -> osl::detail::convertUrlToPathname -> getSystemPathFromFileUrl -> decodeFromUtf8 -> convert -> UnicodeToTextConverter_Impl::convert -> rtl_convertUnicodeToText tries to translate the Unicode chars of "hybrid_writer_абв_αβγ.pdf" to osl_getThreadTextEncoding() == RTL_TEXTENCODING_ASCII_US, but which doesn't work because ASCII has no representation of the Cyrillic and Greek letters.

in the "C" locale, every 8-bit value is valid, but only ASCII (<128) values are meaningful; the intent is that the application does not interpret file-names, but uses them as-is, and replacing characters with '?' (as apparently happens here) looks wrong to me.

probably there isn't yet a RTL_TEXTENCODING_C that behaves like this.




[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux