Re: using Tsearch2 for chemical text

Naz Gassiep <naz@xxxxxxxx> · Thu, 26 Jul 2007 15:53:05 +1000

I think you might need to write a custom lexer to divide the strings
into meaningful units.  If there are subsections of these names that
make sense to search for, then tsearch2 can certainly handle the
mechanics of that, but I doubt that the standard rules will divide
these names into lexemes usefully.

A custom lexer for tsearch2 that recognized chemistry related lexical 
components (di-, tetra-, acetyl-, ethan-, -oic, -ane, -ene etc) would 
increase *hugely* the out-of-the-box applicability of PostgreSQL to 
scientific applications. Perhaps such an effort could be co ordinated 
with a physics based lexer and biology related lexer, to perhaps provide 
a unified lexer that provided full scientific capabilities in the way 
that PostGIS provides unified geospatial capabilities.

I don't know how best to bring such an effort about, but I do know that 
if such a thing were created it would be a boon for PostgreSQL, giving 
it a very significant leg up in terms of functionality, not to mention 
the great positive impact that the wide, free availability of such a 
tool would have on the scientific research community.

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend