On Mon, Oct 26, 2009 at 4:02 PM, Jesper Krogh <jesper@xxxxxxxx> wrote: > Hi. > > I'm currently trying to figure out why the tsearch performance seems to > vary a lot between different queryplans. I have created a sample dataset > that sort of resembles the data I have to work on. > > The script that builds the dataset is at: > http://krogh.cc/~jesper/build-test.pl > and http://krogh.cc/~jesper/words.txt is needed for it to run. > > Test system.. average desktop, 1 SATA drive and 1.5GB memory with pg 8.4.1. > > The dataset consists of words randomized, but .. all records contains > "commonterm", around 80% contains commonterm80 and so on.. > > my $rand = rand(); > push @doc,"commonterm" if $commonpos == $j; > push @doc,"commonterm80" if $commonpos == $j && $rand < 0.8; > > Results are run multiple times after each other so they should be > reproducible: > > ftstest=# explain analyze select id from ftstest where body_fts @@ > to_tsquery('commonterm80'); > QUERY PLAN > > ---------------------------------------------------------------------------------------------------------------- > Seq Scan on ftstest (cost=0.00..10750.00 rows=40188 width=4) (actual > time=0.102..1792.215 rows=40082 loops=1) > Filter: (body_fts @@ to_tsquery('commonterm80'::text)) > Total runtime: 1809.437 ms > (3 rows) > > ftstest=# set enable_seqscan=off; > SET > ftstest=# explain analyze select id from ftstest where body_fts @@ > to_tsquery('commonterm80'); > QUERY PLAN > > --------------------------------------------------------------------------------------------------------------------------------------- > Bitmap Heap Scan on ftstest (cost=115389.14..125991.96 rows=40188 > width=4) (actual time=17.445..197.356 rows=40082 loops=1) > Recheck Cond: (body_fts @@ to_tsquery('commonterm80'::text)) > -> Bitmap Index Scan on ftstest_gin_idx (cost=0.00..115379.09 > rows=40188 width=0) (actual time=13.370..13.370 rows=40082 loops=1) > Index Cond: (body_fts @@ to_tsquery('commonterm80'::text)) > Total runtime: 204.201 ms > (5 rows) > > Given that the seq-scan have to visit 50K row to create the result and > the bitmap heap scan only have to visit 40K (but search the index) we > would expect the seq-scan to be at most 25% more expensive than the > bitmap-heap scan.. e.g. less than 300ms. I've seen behavior similar to this in the past with a plain old B-tree index. As in your case, a bitmap index scan was significantly faster than a sequential scan even though essentially all the heap pages had to be scanned, but the planner expected the opposite to be true. The planner's expectation is that the dominent cost will be fetching the pages, and it furthermore thinks that fetching things in sequential order is much better than skipping around randomly. However, if all the pages are memory-resident - possibly even in L2 or L3 CPU cache - fetching the pages is nearly free, so the dominant cost becomes the CPU time to process the tuples. My best guess is that in cases like this index cond is cheaper to evaluate than the recheck cond/filter, so the index scan wins not by reading fewer pages but by avoiding the need to examine some of the tuples on each page. I might be all wet, though. If your whole database fits in RAM, you could try changing your seq_page_cost and random_page_cost variables from the default values of 1 and 4 to something around 0.05, or maybe even 0.01, and see whether that helps. But if it's just this query that is in cache and you have lots of other things that are going to disk, that's harder to tune. You can probably still lower the default values somewhat, but if you go crazy with it you'll start to have problems in the other direction. ...Robert -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance