Hi Jeff,
You're right about that point. Let me redefine. I would like to drop all tokens which neither are the stemmed or unstemmed version of a known word. Would there be the possibility of putting a wordlist as a filter ahead of the stemming? Or do you know about a good English lexeme list that could be used to filter after stemming?
Thanks, Christoph
Hi everybody,
I am trying to get all the lexemes for a text using to_tsvector(). But I want only words that english_stem -- the integrated snowball dictionary -- is able to handle to show up in the final tsvector. Since snowball dictionaries only remove stop words, but keep the words that they cannot stem, I don't see an easy option to do this. Do you have any ideas?
I went ahead with creating a new configuration:
-- add new configuration english_led CREATE TEXT SEARCH CONFIGURATION public.english_led (COPY = pg_catalog.english);
-- dropping any words that contain numbers already in the parser ALTER TEXT SEARCH CONFIGURATION english_led DROP MAPPING FOR numword;
EXAMPLE:
SELECT * from to_tsvector('english_led','A test sentence with ui44 \tt somejnk words'); to_tsvector -------------------------------------------------- 'sentenc':3 'somejnk':6 'test':2 'tt':5 'word':7
In this tsvector, I would like 'somejnk' and 'tt' not to be included.
I don't think the question is well defined. It will happily stem 'somejnking' to ' somejnk', doesn't that mean that it **can** handle it? The fact that 'somejnk' itself wasn't altered during stemming doesn't mean it wasn't handled, just like 'test' wasn't altered during stemming.
Cheers,
Jeff
|