Re: Initial ugly reverse-translator

Oleg Bartunov <oleg@xxxxxxxxxx> · Sat, 19 Apr 2008 21:10:38 +0400 (MSD)

On Sat, 19 Apr 2008, Tom Lane wrote:

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> writes:
Tom Lane wrote:
I don't really see the problem.  I assume from your reference to pg_trgm
that you're using trigram similarity as the prefilter for potential
matches

It turns out that's no good anyway, as it appears to ignore characters
outside the ASCII range. Rather less than useful for searching a
database of translated strings ;-)

A quick look at the pg_trgm code suggests that it is only prepared to
deal with single-byte encodings; if you're working in UTF8, which I
suppose you'd have to be, it's dead in the water :-(.  Perhaps fixing
that should be on the TODO list.

as well as ltree. they are in our todo list:
http://www.sai.msu.su/~megera/wiki/TODO

But in any case maybe the full-text-search stuff would be more useful
as a prefilter?  Although honestly, for the speed we need here, I'm
not sure a prefilter is needed at all.  Full text might be useful
if a LIKE-based match fails, though.

(And besides, speed doesn't seem like the be-all and end-all here.)

True. It's not so much the speed as the fragility when faced with small
changes to formatting. In addition to whitespace, some clients mangle
punctuation with features like automatic "curly"-quoting.

Yeah.  I was wondering whether encoding differences wouldn't be a huge
problem in practice, as well.

			regards, tom lane

	Regards,
		Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@xxxxxxxxxx, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83