I'm trying to speed up a query on a text column of a 14M-row table.
Uncached query times vary between 1-20 seconds (maybe more), depending
on the search
term. In between time trials I've been trying to flush the disk buffer
cache by selecting count(*) from a separate 4GB table, and times are pretty
consistent for a given search term. If a search term hasn't been seen
in a while,
or my full table scan has purged memory, a query may take 20 seconds,
whereas
if the relevant pages are cached, it may take 8.5 milliseconds. I'd
really like to
avoid that 20 second turn-off for users.
I'm guessing I need to put lots of RAM in this machine (currently only 2GB;
max 8GB) and somehow pre-load the entire index? The production version
of this
database will be read-only, btw. Because this index will be key to the
whole
application, and because the number of possible search terms is so large, it
would be nice if the entire index could somehow be preloaded into memory
and
encouraged to stay there. Alternatively, we could prime the disk buffers by
doing searches on what we guess will be the most common terms. I wasn't
paying
attention to the recent thread about ram disks, but maybe this is a
situation that
might call for one?
The rest of this message contains details about the situation.
Thanks for the advice, as usual!
-Kevin Murphy
Sample query:
explain analyze select * from tagged_genes where mention = 'bcl2';
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using tagged_genes_mention_idx on tagged_genes
(cost=0.00..327.64 rows=81 width=67) (actual time=28.694..5544.779
rows=848 loops=1)
Index Cond: (mention = 'bcl2'::text)
Total runtime: 5545.434 ms
The index of the query column (mention) is 226,229 pages (= 1.7 GB?).
There are 1.3M unique values in that column. I've run 'vacuum analyze'
on the table. I also tried setting the statistics target to 1000, and
it did
speed up searches for some search terms.
Just out of curiosity, and because I am interested in possibly using
tsearch2 in the future, I created a tsvector column indexing the
mention column (and added a GIST index and vacuum-analyzed the table
again). tsearch2 is a lot slower, presumably because it's doing a lot
more (although in this case, it doesn't return all that much more).
Here is a typical result of a tsearch2 search on my data:
explain analyze select * from tagged_genes where vector @@
to_tsquery('bcl2');
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using vector_idx on tagged_genes (cost=0.00..56092.98
rows=13983 width=67) (actual time=202.078..43122.688 rows=980 loops=1)
Index Cond: (vector @@ '\'bcl2\''::tsquery)
Total runtime: 43124.215 ms
Blech. I'd love to use tsearch2, but it's hard to see it being useful
with my puny hardware and not so puny data.
I'm using PG 8.0.3 on Mac OS X 10.4.2 on a dual 2.5GHz G5 currently
with 2GB RAM. The data is physically stored on an Xserve RAID array
(seven 7200rpm ultra-ATA drives, RAID 5) connected via fibre channel
directly to the G5. I did some tests of this array's performance for a
single
user. Using a large block size (256K), this array can do sequential reads
at 134 MB/sec, but the performance drops to 12MB/sec for 4K
sequential reads. Random 4K reads are 5MB/sec, and random 256K
reads are 121MB/sec (all these according to "XBench" - not sure if it's a
good benchmark).
I'm using these shared memory settings:
kern.sysv.shmmax=134217728
kern.sysv.shmall=32768
kern.sysv.shmmin=1
kern.sysv.shmmni=32
kern.sysv.shmseg=8
and from postgresql.conf:
shared_buffers = 15000
work_mem = 32768
Everything else in postgresql.conf is default.
-------
Here is the table description:
tagged_genes table:
13,982,464 rows
422,028 pages (although about half of that is the experimental tsvector
column, though!)
create table tagged_genes (
id bigint NOT NULL PRIMARY KEY, -- artificial primary key
mention text, -- a gene name or description
pmid bigint, -- identifies the document that
the mention occurs in
create_date timestamp NOT NULL,
vector tsvector -- experimental tsearch2 index of
mention column
);
create index tg_mention_idx on tagged_genes(mention);
create index tg_pmid_idx on tagged_genes(pmid);
create index tg_vector_idx on tagged_genes(vector);
========================================
Some trials with different search terms:
mycn:
Index Scan using tagged_genes_mention_idx on tagged_genes
(cost=0.00..327.64 rows=81 width=67) (actual time=41.703..2751.600
rows=479 loops=1)
Index Cond: (mention = 'mycn'::text)
Total runtime: 2751.936 ms
mycn trials: 2752 ms, 2755 ms, 2766 ms
bcl2:
Index Scan using tagged_genes_mention_idx on tagged_genes
(cost=0.00..327.64 rows=81 width=67) (actual time=28.694..5544.779
rows=848 loops=1)
Index Cond: (mention = 'bcl2'::text)
Total runtime: 5545.434 ms
bcl2 trials: 5545 ms, 5492 ms, 5493 ms
cyp3a4:
Index Scan using tagged_genes_mention_idx on tagged_genes
(cost=0.00..7867.60 rows=1958 width=67) (actual time=58.138..9602.558
rows=1985 loops=1)
Index Cond: (mention = 'cyp3a4'::text)
Total runtime: 9603.733 ms
cyp3a4 trials: 9604 ms, 11872 ms, 9970 ms
tp53:
Index Scan using tagged_genes_mention_idx on tagged_genes
(cost=0.00..327.64 rows=81 width=67) (actual time=28.505..8064.808
rows=1484 loops=1)
Index Cond: (mention = 'tp53'::text)
Total runtime: 8065.791 ms
P.S. I've heard that OS X sucks as a server OS. I don't know if that's
true, but I'm interested to install a PPC Linux and give that a go for
the sake of comparison.
Don't know where that would leave my storage hardware, though.
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend