Hello all, for providing proper English stemming support in searches (tsearch2 on PostgreSQL 8.3), tsearch needs the British/American myspell dictionaries. However, this system currently seems to be very inconvenient to packagers like me (I'm responsible for the Debian and Ubuntu packages of PostgreSQL), who would like to provide a good out-of-the-box experience. Ideally, installing the myspell-en-gb package would automatically make tsearch2 aware of it and use the dictionary and affix rules. However, we found several problems which don't make this possible: - tsearch2 looks for these files in ${configure_datadir}/tsearch_data, i. e. /usr/share/postgresql/8.3/tsearch_data/ in Debian, whereas myspell dictionaries are shipped in /usr/share/myspell/dicts/. This by itself is probably fixable easily, by adding another search path to tsearch2. This would probably end up as a configure option. - Reportedly PostgreSQL expects those myspell files to be encoded in the server encoding. However, the server encoding can be changed at runtime, whereas the myspell files are shipped statically. Reencoding the .dic from latin1 to UTF-8 during build is possible, but first it's inconvenient, and more importantly, it is either a package maintenance nightmare (when shipping static files), or involves some dirty tricks (rebuilding the postgresql myspell files whenever one of the original myspell files changes). Also, even a reencoding doesn't change the fact that as soon as you change the server locale, you end up with broken tsearch again. Is there any better approach to this? In my dream world, tsearch2 would look in /usr/share/myspell/dicts/, use the dictionary/affix rules there, reencode them on the fly to the server encoding, and otherwise use them as they are. Thanks in advance for any insight! Martin -- Martin Pitt | http://www.piware.de Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)
Attachment:
signature.asc
Description: Digital signature