Hi Roger, On Tue, Sep 23, 2014 at 07:34:19PM +0100, Roger Willcocks wrote: > On Fri, 2014-09-19 at 11:06 -0500, Ben Myers wrote: > > +#define AGE_NAME "DerivedAge.txt" > > +#define CCC_NAME "DerivedCombiningClass.txt" > > +#define PROP_NAME "DerivedCoreProperties.txt" > > +#define DATA_NAME "UnicodeData.txt" > > +#define FOLD_NAME "CaseFolding.txt" > > +#define NORM_NAME "NormalizationCorrections.txt" > > +#define TEST_NAME "NormalizationTest.txt" > > Is there a reason why you're using multiple text-based data files (and > hand-parsing them) when there's an xml formatted flat file available ? > > http://www.unicode.org/Public/UCD/latest/ucdxml/ The UCD files being parsed are the authoritative source. Check out ucdxml.readme.txt. > And a 2nd question - why does the trie need to encode "the the unicode > version in which the codepoint was assigned an interpretation" ? You need to know whether a given code point is assigned in the version of Unicode you're normalizing for. Unicode 8 is supposed to release June/July 2015 (see http://www.unicode.org/versions/), but filesystems you created this year will still need the version 7 normalization. There is still some plumbing to do to pass the version along with the string for normalization. I think you bring up a good point, but we'll need to support multiple versions in the long run. Regards, Ben _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs