Hi, I have quite a simple query but a lot of data and the SELECT query is too slow. I will be really grateful for any advice on this. -------------------------------------------------- The background info: I work on a speech search engine which differs from text search in having more words (hypotheses) on the same position and each hypothesis has some weight (probability) of occurrence. When a word 'hello' appears in a document 'lecture_1', there is a row in the table hyps (see below) which contains an array of all positions of word 'hello' in the document 'lecture_1' and for each position it contains a weight as well. I need the positions to be able to search for phrases. However, here I simplified the query as much as I could without a significant reduction in speed. I know there is tsearch extension which could be more appropriate for this but I didn't try that yet. The size of my data will be the same which seems to be the issue in my case. But maybe I am wrong and with tsearch it will be much faster. What do you think? -------------------------------------------------- Preconditions: First I cleared the disk cache: sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' Then run the postgresql deamon and with psql client I connected to my database. The first thing I did then was executing the SELECT query described below. It took about 4.5 seconds. If I rerun it, it takes less than 2 miliseconds, but it is because of the cache. I need to optimize the first-run. -------------------------------------------------- Hardware: laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM -------------------------------------------------- Version: PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu 4.4.1-4ubuntu9) 4.4.1, 32-bit compiled from sources, only --prefix=... argument given to ./configure -------------------------------------------------- Schema: CREATE TABLE hyps ( docid INT, wordid INT, positions INT[], weights REAL[], length INT, total_weight REAL ); COPY hyps FROM '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy'; CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid); CREATE INDEX hyps_docid_index ON hyps USING hash (docid); shared_buffers = 300MB ...this is the only thing I changed in the config I tried that also with btree indices instead of hash and surprisingly the SELECT query was a bit faster. I would expect hash to be faster. The index on 'docid' column is there because I need to be able to search also in a particular document or in a set of documents. -------------------------------------------------- Table info: - rows = 5490156 - average length of positions vectors = 19.5 - total number of items in positions vectors = 107444304 - positions and weights in one row have the same number of items, but for each row the number may differ. - table data are loaded only once (using COPY) and are not modified anymore - there are 369 various docid and 161460 various wordid - VACUUM was executed after COPY of data -------------------------------------------------- Query: EXPLAIN ANALYZE SELECT h1.docid FROM hyps AS h1 WHERE h1.wordid=65658; Bitmap Heap Scan on hyps h1 (cost=10.97..677.09 rows=171 width=4) (actual time=62.106..4416.864 rows=343 loops=1) Recheck Cond: (wordid = 65658) -> Bitmap Index Scan on hyps_wordid_index (cost=0.00..10.92 rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1) Index Cond: (wordid = 65658) Total runtime: 4432.015 ms The result has 343 rows and there are 9294 items in positions vectors in total. -------------------------------------------------- Comparison with Lucene: If I run the same query in Lucene search engine, it takes 0.105 seconds on the same data which is quite a huge difference. -------------------------------------------------- Synthetic data set: If you want to try it yourself, here is a script which generates the data for COPY command. I don't know whether it is possible to send attachments here, so I put the script inline. Just save it as create_synthetic_data.pl and run it by 'perl create_synthetic_data.pl'. With these synthetic data the SELECT query times are around 2.5 seconds. You can try the SELECT query with 'wordid' equal 1, 2, 3, ...10000. #!/usr/bin/perl # Create synthetic data for PostgreSQL COPY. $rows = 5490156; $docs = 369; $words = 161460; $docid = 0; $wordid = 0; for ($row=0; $row<$rows; $row++) { my $sep = ""; my $positions = ""; my $weights = ""; my $total_weight = 0; my $items = int(rand(39))+1; if ($row % int($rows/$docs) == 0) { $docid++; $wordid = 0; } $wordid++; for ($i=0; $i<$items; $i++) { $position = int(rand(20000)); $weight = rand(1); $positions .= $sep.$position; $weights .= $sep.sprintf("%.3f", $weight); $total_weight += $weight; $sep = ","; } print "$docid\t$wordid\t{$positions}\t{$weights}\t$items\t$total_weight\n"; } If you need any other info, I will gladly provide it. Thank You for Your time. Miso Fapso -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance