Re: [RFC v2] Unicode/UTF-8 support for XFS

Olaf Weber <olaf@xxxxxxx> · Fri, 26 Sep 2014 16:06:22 +0200

On 24-09-14 13:07, Olaf Weber wrote:
On 23-09-14 22:15, Andi Kleen wrote:

A big part of the table does decompositions for Korean: eliminating
the Hangul decompositions removes 156320 bytes, leaving 89936 bytes.

Are there regular ranges or other redundancies in the Korean encoding
that could be used to compress paths?

Yes, though at the expense of more complicated code and interfaces. in
particular, lookups that want a normalized string would need to provide a
10-byte buffer to store it in.

I spent some time working on this, and the effect on the lookup code isn't 
as bad as I'd thought. The updated code should be posted early next week.

With this change, the table size for the full trie becomes 89952 bytes. Of 
this, 66400 bytes are spent on the NFKD + Ignorables, an additional 20992 
bytes on NFDK + Ignorables + Case Fold. The remainder, 2560 bytes, are 
additional info for older unicode versions.

Note that the NFDK + Ignorables + Case Fold trie forwards to the NFKD + 
Ignorables where they overlap. A stand-alone version would be 71750 bytes.

As noted before these tables also contain the Canonical Combining Class and 
unicode version information for the code points. The latter allows for 
supporting multiple unicode versions using a single combined table.

Olaf

--
Olaf Weber                 SGI               Phone:  +31(0)30-6696796
                           Veldzigt 2b       Fax:    +31(0)30-6696799
Technical Lead             3454 PW de Meern  Vnet:   955-6796
Storage Software           The Netherlands   Email:  olaf@xxxxxxx

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs