I forgot to mention one thing. If you want to generate data using the perl script, do this: perl create_synthetic_data.pl > synthetic_data.sqlcopy and then after you create the 'hyps' table, use the COPY command with the generated file: COPY hyps FROM '/the/full/path/synthetic_data.sqlcopy'; Best regards, Miso Fapso On 2 July 2010 00:34, Michal Fapso <michal.fapso@xxxxxxxxx> wrote: > Hi, > > I have quite a simple query but a lot of data and the SELECT query is > too slow. I will be really grateful for any advice on this. > > -------------------------------------------------- > The background info: > > I work on a speech search engine which differs from text search in > having more words (hypotheses) on the same position and each > hypothesis has some weight (probability) of occurrence. > > When a word 'hello' appears in a document 'lecture_1', there is a row > in the table hyps (see below) which contains an array of all positions > of word 'hello' in the document 'lecture_1' and for each position it > contains a weight as well. > > I need the positions to be able to search for phrases. However, here I > simplified the query as much as I could without a significant > reduction in speed. > > I know there is tsearch extension which could be more appropriate for > this but I didn't try that yet. The size of my data will be the same > which seems to be the issue in my case. But maybe I am wrong and with > tsearch it will be much faster. What do you think? > > -------------------------------------------------- > Preconditions: > > First I cleared the disk cache: > sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' > > Then run the postgresql deamon and with psql client I connected to my > database. The first thing I did then was executing the SELECT query > described below. It took about 4.5 seconds. If I rerun it, it takes > less than 2 miliseconds, but it is because of the cache. I need to > optimize the first-run. > > -------------------------------------------------- > Hardware: > > laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM > > -------------------------------------------------- > Version: > > PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu > 4.4.1-4ubuntu9) 4.4.1, 32-bit > > compiled from sources, only --prefix=... argument given to ./configure > > -------------------------------------------------- > Schema: > > CREATE TABLE hyps ( > docid INT, > wordid INT, > positions INT[], > weights REAL[], > length INT, > total_weight REAL > ); > COPY hyps FROM '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy'; > CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid); > CREATE INDEX hyps_docid_index ON hyps USING hash (docid); > > shared_buffers = 300MB ...this is the only thing I changed in the config > > I tried that also with btree indices instead of hash and surprisingly > the SELECT query was a bit faster. I would expect hash to be faster. > > The index on 'docid' column is there because I need to be able to > search also in a particular document or in a set of documents. > -------------------------------------------------- > Table info: > > - rows = 5490156 > - average length of positions vectors = 19.5 > - total number of items in positions vectors = 107444304 > - positions and weights in one row have the same number of items, but > for each row the number may differ. > - table data are loaded only once (using COPY) and are not modified anymore > - there are 369 various docid and 161460 various wordid > - VACUUM was executed after COPY of data > > -------------------------------------------------- > Query: > > EXPLAIN ANALYZE SELECT h1.docid > FROM hyps AS h1 > WHERE h1.wordid=65658; > > Bitmap Heap Scan on hyps h1 (cost=10.97..677.09 rows=171 width=4) > (actual time=62.106..4416.864 rows=343 loops=1) > Recheck Cond: (wordid = 65658) > -> Bitmap Index Scan on hyps_wordid_index (cost=0.00..10.92 > rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1) > Index Cond: (wordid = 65658) > Total runtime: 4432.015 ms > > The result has 343 rows and there are 9294 items in positions vectors in total. > > -------------------------------------------------- > Comparison with Lucene: > > If I run the same query in Lucene search engine, it takes 0.105 > seconds on the same data which is quite a huge difference. > > -------------------------------------------------- > Synthetic data set: > > If you want to try it yourself, here is a script which generates the > data for COPY command. I don't know whether it is possible to send > attachments here, so I put the script inline. Just save it as > create_synthetic_data.pl and run it by 'perl > create_synthetic_data.pl'. With these synthetic data the SELECT query > times are around 2.5 seconds. You can try the SELECT query with > 'wordid' equal 1, 2, 3, ...10000. > > > #!/usr/bin/perl > # Create synthetic data for PostgreSQL COPY. > > $rows = 5490156; > $docs = 369; > $words = 161460; > $docid = 0; > $wordid = 0; > > for ($row=0; $row<$rows; $row++) { > > my $sep = ""; > my $positions = ""; > my $weights = ""; > my $total_weight = 0; > my $items = int(rand(39))+1; > > if ($row % int($rows/$docs) == 0) { > $docid++; > $wordid = 0; > } > $wordid++; > > for ($i=0; $i<$items; $i++) { > $position = int(rand(20000)); > $weight = rand(1); > $positions .= $sep.$position; > $weights .= $sep.sprintf("%.3f", $weight); > $total_weight += $weight; > $sep = ","; > } > print "$docid\t$wordid\t{$positions}\t{$weights}\t$items\t$total_weight\n"; > } > > > If you need any other info, I will gladly provide it. > > Thank You for Your time. > Miso Fapso > -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance