Re: vfat: Broken case-insensitive support for UTF-8

Pali Rohár <pali.rohar@xxxxxxxxx> · Mon, 27 Jan 2020 00:08:38 +0100

On Tuesday 21 January 2020 21:34:05 Pali Rohár wrote:
> On Tuesday 21 January 2020 00:07:01 Al Viro wrote:
> > On Tue, Jan 21, 2020 at 12:57:45AM +0100, Pali Rohár wrote:
> > > On Monday 20 January 2020 22:46:25 Al Viro wrote:
> > > > On Mon, Jan 20, 2020 at 10:40:46PM +0100, Pali Rohár wrote:
> > > > 
> > > > > Ok, I did some research. It took me it longer as I thought as lot of
> > > > > stuff is undocumented and hard to find all relevant information.
> > > > > 
> > > > > So... fastfat.sys is using ntos function RtlUpcaseUnicodeString() which
> > > > > takes UTF-16 string and returns upper case UTF-16 string. There is no
> > > > > mapping table in fastfat.sys driver itself.
> > > > 
> > > > Er...  Surely it's OK to just tabulate that function on 65536 values
> > > > and see how could that be packed into something more compact?
> > > 
> > > It is OK, but too complicated. That function is in nt kernel. So you
> > > need to build a new kernel module and also decide where to put output of
> > > that function. It is a long time since I did some nt kernel hacking and
> > > nowadays you need to download 10GB+ of Visual Studio code, then addons
> > > for building kernel modules, figure out how to write and compile simple
> > > kernel module via Visual Studio, write ini install file, try to load it
> > > and then you even fail as recent Windows kernels refuse to load kernel
> > > modules which are not signed...
> > 
> > Wait a sec...  From NT userland, on a mounted VFAT:
> > 	for all s in single-codepoint strings
> > 		open s for append
> > 		if failed
> > 			print s on stderr, along with error value
> > 		write s to the opened file, adding to its tail
> > 		close the file
> > the for each equivalence class you'll get a single file, with all
> > members of that class written to it.  In addition you'll get the
> > list of prohibited codepoints.
> > 
> > Why bother with any kind of kernel modules?  IDGI...
> 
> This is a great idea to get FAT equivalence classes. Thank you!
> 
> Now I quickly tried it... and it failed. FAT has restriction for number
> of files in a directory, so I would have to do it in more clever way,
> e.g prepare N directories and then try to create/open file for each
> single-point string in every directory until it success or fail in every
> one.

Now I have done test with more directories and finally it passed. I run
it on WinXP with different configurations And results are interesting...

First important thing: DOS OEM codepage is implicitly configured by
option "Language for non-Unicode programs" found in "Regional and
Language Options" at "Advanced" tab (run: intl.cpl). It is *not*
affected by "Standards and formats" language and also *not* by
"Location" language. Description for "Language for non-Unicode programs"
says: "It does not affect Unicode programs" which is clearly non-truth
as it affects all Unicode programs which stores data to FAT fs.

Second thing: Equivalence classes depends on OEM codepage. And are
different. Note that some languages shares one codepage.

CP850 (languages: English UK, Afrikaans, ...) has 614 non-trivial (*)
equivalence classes, CP852 (Slavic languages) has 619 and CP437 (English
USA) has only 586.

The biggest equivalence class is for 'U' and has following elements:

CP437:
0x0055 0x0075 0x00d9 0x00da 0x00db 0x00f9 0x00fa 0x00fb 0x0168 0x0169
0x016a 0x016b 0x016c 0x016d 0x016e 0x016f 0x0170 0x0171 0x0172 0x0173
0x01af 0x01b0 0x01d3 0x01d4 0x01d5 0x01d6 0x01d7 0x01d8 0x01d9 0x01da
0x01db 0x01dc 0xff35 0xff55

CP852:
0x0055 0x0075 0x00b5 0x00d9 0x00db 0x00f9 0x00fb 0x0168 0x0169 0x016a
0x016b 0x016c 0x016d 0x0172 0x0173 0x01af 0x01b0 0x01d3 0x01d4 0x01d5
0x01d6 0x01d7 0x01d8 0x01d9 0x01da 0x01db 0x01dc 0x03bc 0xff35 0xff55

CP850:
0x0055 0x0075 0x0168 0x0169 0x016a 0x016b 0x016c 0x016d 0x016e 0x016f
0x0170 0x0171 0x0172 0x0173 0x01af 0x01b0 0x01d3 0x01d4 0x01d5 0x01d6
0x01d7 0x01d8 0x01d9 0x01da 0x01db 0x01dc 0xff35 0xff55

Just to note that elements are Unicode code points.

It is interesting that for English USA (CP437) are "U" and "Ù" in same
equivalence class, but for English UK (CP850) are "U" and "Ù" in
different classes. CP850 has "U" in two-member class: 0x00d9 0x00f9

Are there any cultural, regional or linguistic reasons why English USA
and English UK languages/regions should treat "Ù" differently?

So third thing? How should be handle this complicated situation for our
VFAT implementation in Linux kernel when using UTF-8 encoding for
userspace?

For fixing case-insensitivity for UTF-8 I see there following options:

Option 1) Create intersect of equivalence classes from all codepages and
use this for Linux VFAT uppercase function. This would ensure that
whatever codepage/language windows uses, Linux VFAT does not create
inaccessible files for Windows (see PPS).

Option 2) As equivalence classes depends on codepage and VFAT already
needs to know codepage when mounting/accessing shortnames, we can
calculate "common" uppercase table (which would same for all codepages,
ideally from option 1)) and then differences from "common" uppercase
table to equivalence classes. Kernel already has uppercase tables for
NLS codepages and so we can store these "differences" to them. In this
case VFAT would know to uppercase function for specified codepage (which
is already passed as mount param).

Option 3) Ignores this MS shit nonsense (see PPS how it is broken) and
define uppercase table from Unicode standard. This would be the most
expected behavior for userspace, but incompatible with MS FAT32
implementation.

Option 4) Use uppercase table from Unicode standard (as in option 3),
but adds also definitions from option 1). This would ensure that all
files created by VFAT would be accessible on any Windows systems (see
PPS), plus there would be uppercase definitions from Unicode standard
(but only those which do not break definitions from 1) with respect to
PPS).

Option 5) Create API for kernel <---> userspace which would allow
userspace to define mapping table (or equivalence classes) and throw
away this problem from kernel to userspace. But as we already discussed
this is hard, plus without proper configuration from userspace, kernel's
VFAT driver could modify FS in way that MS would not be able to use it.

Or do you have a better idea how to handle this problem?

(*) - with more then one element

PS: If somebody is interested I can share my whole results and source
code of testing application.

PPS: If you create two files "U" and "Ù" on English UK (you can do that
as these codepoints are in different equivalence classes) and then
connect this FAT32 fs on English USA, you would not be able to access
"Ù" file. Windows English USA list both files "U" and "Ù", but whichever
you open, Windows get you always content of file "U". "Ù" is therefore
inaccessible until you change language to English UK.

-- 
Pali Rohár
pali.rohar@xxxxxxxxx
Attachment:
signature.asc

Description: PGP signature