On Thu, 13 Sep 2007, Laimonas Simutis wrote:
Hey guys, maybe anyone using tsearch2 could advise on this. With the default installation, url, host and some other tokens are processed with the simple dictionary. Thus term like mywebsite.com gets stored as 'mywebsite.com'. The parser correctly assigns token id of type host to the term, but then the dictionary the terms gets routed through is simple and what gets stored is mywebsite.com The questions are: 1) is there a dictionary available that I could utilize that will remove .com, .net, .org, etc? I could write one myself, but after seeing some sample dictionary implementations and C code I try to avoid, I got scared a bit.
Yes, we have dict_regex, which was developed by Sergey Karpov, see details http://lynx.sao.ru/~karpov/software/postgres_dict_regex.html It uses pcre library and you need to know perl regexps.
2) has anyone else dealt with this maybe in a different way?
sure, preprocess text using prefered language before passing to ro_tsvector
Thanks for any suggestions and help, Laimis
Regards, Oleg _____________________________________________________________ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster