Re: vfat: Broken case-insensitive support for UTF-8

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Monday 20 January 2020 12:32:15 Theodore Y. Ts'o wrote:
> On Mon, Jan 20, 2020 at 01:04:42PM +0900, OGAWA Hirofumi wrote:
> > 
> > To be perfect, the table would have to emulate what Windows use. It can
> > be unicode standard, or something other. And other fs can use different
> > what Windows use.
> 
> The big question is *which* version of Windows.  vfat has been in use
> for over two decades, and vfat predates Window starting to use Unicode
> in 2001.  Before that, vfat would have been using whatever code page
> its local Windows installation was set to sue; and I'm not sure if
> there was space in the FAT headers to indicate the codepage in use.

VFAT is extension to FAT which stores file names in UTF-16. In original
FAT without VFAT extension (in all variants, FAT12, FAT16 and FAT32) is
file name stored "according to current 8bit OEM code page". VFAT-aware
FAT implementation would know if particular filename is really VFAT
(UTF-16) or without VFAT (8bit OEM code page). There are flags in FAT
which indicates if entry is VFAT (UTF-16).

And no, there are no bits in FAT header which specify OEM code page.
So if you use "mode con" or "chcp" (or what was those MS-DOS commands
for changing OEM codepage), all non-VFAT filenames would change after
next reading of FAT directory.

But because every OEM code page is full 8bit, you always get valid data.
Just you would see that your file name is different :D

> It would be entertaining for someone with ancient versions of Windows
> 9x to create some floppy images using codepage 437 and 450, and then
> see what a modern Windows system does with those VFAT images --- would

Hehe :-) I did it as part of my investigation, how is stored FAT volume
label and how different tools read it. FAT label is *not* stored as
UTF-16 but only in that OEM code page like old filenames on MS-DOS
https://www.spinics.net/lists/kernel/msg2640891.html

And what recent Windows do? They decode such filenames (and therefore
also volume label) via OEM codepage which belongs to current system
Language settings. You cannot change OEM codepage on recent Windows. You
can only change Regional Language (which then change OEM codepage which
belongs to it).

Mapping table between Windows Regional Language and OEM codepage is in
(still unreleased) fatlabel(8) manpage, section DOS CODEPAGES, here:
https://github.com/dosfstools/dosfstools/blob/master/manpages/fatlabel.8.in

> it break horibbly when it tries to interpret them as UTF-16?  Or would

As Windows knows that filename is stored as 8bit and not UTF-16, nothing
is broken. Just for characters with upper bit set you probably does not
see filenames as you saw in MS-DOS.

But if you remember which OEM code page you used on MS-DOS, you can
change Windows Language to one which uses your OEM code page and then
you can read that old FAT fs without any broken file names.

> it figure it out?  And if so, how?  Inquiring minds want to know....
> 
> Bonus points if the lack of forwards compatibility causes older
> versions of Windows to Blue Screen.  :-)

I have not got any Blue Screens during reading of these older FAT fs
created and used by MS-DOS.

On Linux it is easier, just specify -o codepage= mount option and
vfat.ko translate it correctly.

> 
>       	     	   	  		   	- Ted
> 
> P.S.  And of course, then there's the question of how does older
> versions of Windows handle versions of Unicode which postdate the
> release date of that particular version of Windows?  After all,

This is not a problem. Windows allows you to store into filename
arbitrary sequence of uint16[] (except disallowed MS-DOS chars like
:?<>...). And when doing read directory operation you need to expect
that it will returns arbitrary sequence of uint16[].

Windows does not care about valid/invalid/assigned/unassigned code
points. It even do not care about halves of surrogate pairs. So it can
store also one half of (unpaired) surrogate pair (one uint16).

> Unicode adds new code points with potential revisions to the case
> folding table every 6-12 months.  (The most recent version of Unicode
> was released in in April 2019 to accomodate the new Japanese kanji
> character "Rei" for the current era name with the elevation of the new
> current reigning emperor of Japan.)

-- 
Pali Rohár
pali.rohar@xxxxxxxxx

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux