On Fri, 31 Oct 2008, Jodok Batlogg wrote:
hi oleg,
thanks for your quick response,
2008/10/31 Oleg Bartunov <oleg@xxxxxxxxxx>:
Jodok,
you got what's you defined. Please, read documentation.
In short, word doesn't indexed if it is not recognized by any
dictionaried from stack of dictionaries. Put stemming dictionary at the end,
which recognizes everything.
can you point me to "the" documentation where i could find that? i
think i tried hard :)
well, it's not really hard
http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html
"A text search configuration binds a parser together with a set of
dictionaries to process the parser's output tokens. For each token type that
the parser can return, a separate list of dictionaries is specified by the
configuration. When a token of that type is found by the parser, each
dictionary in the list is consulted in turn, until some dictionary recognizes
it as a known word. If it is identified as a stop word, or if no dictionary
recognizes the token, it will be discarded and not indexed or searched for.
The general rule for configuring a list of dictionaries is to place first
the most narrow, most specific dictionary, then the more general dictionaries,
finishing with a very general dictionary, like a Snowball stemmer or simple,
which recognizes everything."
however - problem a) is fixed. thanks :)
nevertheless i still have the problem that words with '/' are beeing
interpreted as file paths instead of words. any idea how i could tweak
this?
several ways:
1. use your own parser
2. use encode/decode functions, which cheat default parser. For example,
encodeslash('aa/bb') -> aaxxxxxxbb. But then you should understand, that
dictionary like ispell will not be able to recognize it.
thanks
jodok
Oleg
On Fri, 31 Oct 2008, Jodok Batlogg wrote:
we're using tsearch2 with the german dictionary
http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz
for fulltext search.
the indexing is configured as follows:
CREATE TEXT SEARCH DICTIONARY public.german (
TEMPLATE = ispell,
DictFile = german,
AffFile = german,
StopWords = german
);
CREATE TEXT SEARCH CONFIGURATION public.default ( COPY = pg_catalog.german
);
ALTER TEXT SEARCH CONFIGURATION public.default
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH public.german;
-------------------------
select * from ts_debug('default', 'hundshЪЪtte');
works as expected: creates the two lexemes: "{hund,hЪЪtte}"
BUT
SELECT to_tsvector('default','lovely und bauarbeiter/in');
looses a lot of stuff:
"'bauarbeiter/in':2"
some more debugging shows:
SELECT * from ts_debug('default','lovely und bauarbeiter/in');
"asciiword";"Word, all ASCII";"lovely";"{german}";"german";""
"blank";"Space symbols";" ";"{}";"";""
"asciiword";"Word, all ASCII";"und";"{german}";"german";"{}"
"blank";"Space symbols";" ";"{}";"";""
"file";"File or path
name";"bauarbeiter/in";"{simple}";"simple";"{bauarbeiter/in}"
a) unknown words are just beeing dropped
b) words with slashes are interpreted as file paths and the first path
is beeing dropped.
any idea how we can fix this?
jodok
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general