FTS performance with the Polish config

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




Hello

I just finished implementing a "search engine" for my site and found ts_headline extremely slow when used with a Polish tsearch configuration, while fast with English. All of it boils down to a simple testcase, but first some background.

I tested on 8.3.1 on G5/OSX 10.5.8 and Xeon/Gentoo AMD64-2008.0 (2.6.21), then switched both installations to 8.3.8 (both packages compiled, but provided by the distro - port/emerge). The Polish dictionaries and config were created according to this article (it's in Polish, but the code is self-explanatory):

http://www.depesz.com/index.php/2008/04/22/polish-tsearch-in-83-polski-tsearch-w-postgresie-83/

 Now for the testcase:

text = 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'

# explain analyze select ts_headline('polish', text, plainto_tsquery('polish', 'foobar')); QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0) (actual time=6.407..6.470 rows=1 loops=1)
 Total runtime: 6.524 ms
(2 rows)

# explain analyze select ts_headline('english', text, plainto_tsquery('english', 'foobar')); QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.861..0.895 rows=1 loops=1)
 Total runtime: 0.935 ms
(2 rows)

# explain analyze select ts_headline('simple', text, plainto_tsquery('simple', 'foobar')); QUERY PLAN
------------------------------------------------------------------------------------
Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.627..0.660 rows=1 loops=1)
 Total runtime: 0.697 ms
(2 rows)

#

As you can see, the results differ by an order of magnitude between Polish and English. While in this simple testcase it's a non-issue, in the real world this translates into enormous overhead.

One of the queries I ran testing my site's search function took 1870ms. When I took that query and changed all ts_headline(foo) calls to just foo, the time dropped below 100ms. That's the difference between something completely unacceptable and something quite useful.

I can post various details about the hardware, software and specific queries, but the testcases speak for themselves. I'm sure you can easily reproduce my results.

Hints would be very much appreciated, since I've already spent way more time on this, than I could afford.


cheers,
Wojciech Knapik


PS. A few other details can be found here http://pastie.textmate.org/private/hqnqfnsfsknjyjlffzmog along with snippets of my conversations in #postgresql that lead to this testcase. Big thanks to RhodiumToad for helping me with fts for the last couple days ;]




--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux