Re: tsearch2 and hyphenated terms

Tom Lane <tgl@xxxxxxxxxxxxx> · Fri, 11 Apr 2008 12:45:32 -0400

Reece Hart <reece@xxxxxxxxx> writes:
> For the purposes of indexing these names, I suspect I'd get the majority
> of cases by removing a hyphen when it's followed by 1 or 2 chars from
> [a-zA-Z0-9]. Does that require a custom parser?

Yeah, looks like it:

regression=# select * from ts_debug('MCL1 MCL-1');
   alias   |       description        | token |  dictionaries  |  dictionary  | lexemes 
-----------+--------------------------+-------+----------------+--------------+---------
 numword   | Word, letters and digits | MCL1  | {simple}       | simple       | {mcl1}
 blank     | Space symbols            |       | {}             |              | 
 asciiword | Word, all ASCII          | MCL   | {english_stem} | english_stem | {mcl}
 int       | Signed integer           | -1    | {simple}       | simple       | {-1}
(4 rows)

I had thought you might get a "numhword" output, but that only seems to
happen if there's at least one letter after the dash:

regression=# select * from ts_debug('MCL1 MCL-X1');
      alias      |               description                | token  |  dictionaries  |  dictionary  | lexemes  
-----------------+------------------------------------------+--------+----------------+--------------+----------
 numword         | Word, letters and digits                 | MCL1   | {simple}       | simple       | {mcl1}
 blank           | Space symbols                            |        | {}             |              | 
 numhword        | Hyphenated word, letters and digits      | MCL-X1 | {simple}       | simple       | {mcl-x1}
 hword_asciipart | Hyphenated word part, all ASCII          | MCL    | {english_stem} | english_stem | {mcl}
 blank           | Space symbols                            | -      | {}             |              | 
 hword_numpart   | Hyphenated word part, letters and digits | X1     | {simple}       | simple       | {x1}
(6 rows)

			regards, tom lane