big data - slow select (speech search)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have quite a simple query but a lot of data and the SELECT query is
too slow. I will be really grateful for any advice on this.

--------------------------------------------------
The background info:

I work on a speech search engine which differs from text search in
having more words (hypotheses) on the same position and each
hypothesis has some weight (probability) of occurrence.

When a word 'hello' appears in a document 'lecture_1', there is a row
in the table hyps (see below) which contains an array of all positions
of word 'hello' in the document 'lecture_1' and for each position it
contains a weight as well.

I need the positions to be able to search for phrases. However, here I
simplified the query as much as I could without a significant
reduction in speed.

I know there is tsearch extension which could be more appropriate for
this but I didn't try that yet. The size of my data will be the same
which seems to be the issue in my case. But maybe I am wrong and with
tsearch it will be much faster. What do you think?

--------------------------------------------------
Preconditions:

First I cleared the disk cache:
  sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Then run the postgresql deamon and with psql client I connected to my
database. The first thing I did then was executing the SELECT query
described below. It took about 4.5 seconds. If I rerun it, it takes
less than 2 miliseconds, but it is because of the cache. I need to
optimize the first-run.

--------------------------------------------------
Hardware:

laptop ASUS, CPU dual core T2300 1.66GHz, 1.5G RAM

--------------------------------------------------
Version:

PostgreSQL 8.4.4 on i686-pc-linux-gnu, compiled by GCC gcc (Ubuntu
4.4.1-4ubuntu9) 4.4.1, 32-bit

compiled from sources, only --prefix=... argument given to ./configure

--------------------------------------------------
Schema:

CREATE TABLE hyps (
	docid INT,
	wordid INT,
	positions INT[],
	weights REAL[],
	length INT,
	total_weight REAL
);
COPY hyps FROM '/home/miso/exp/speech_search/postgresql/sqlcopy/all_weights_clustered.sqlcopy';
CREATE INDEX hyps_wordid_index ON hyps USING hash (wordid);
CREATE INDEX hyps_docid_index ON hyps USING hash (docid);

shared_buffers = 300MB ...this is the only thing I changed in the config

I tried that also with btree indices instead of hash and surprisingly
the SELECT query was a bit faster. I would expect hash to be faster.

The index on 'docid' column is there because I need to be able to
search also in a particular document or in a set of documents.
--------------------------------------------------
Table info:

- rows = 5490156
- average length of positions vectors = 19.5
- total number of items in positions vectors = 107444304
- positions and weights in one row have the same number of items, but
for each row the number may differ.
- table data are loaded only once (using COPY) and are not modified anymore
- there are 369 various docid and 161460 various wordid
- VACUUM was executed after COPY of data

--------------------------------------------------
Query:

EXPLAIN ANALYZE SELECT h1.docid
FROM hyps AS h1
WHERE h1.wordid=65658;

 Bitmap Heap Scan on hyps h1  (cost=10.97..677.09 rows=171 width=4)
(actual time=62.106..4416.864 rows=343 loops=1)
   Recheck Cond: (wordid = 65658)
   ->  Bitmap Index Scan on hyps_wordid_index  (cost=0.00..10.92
rows=171 width=0) (actual time=42.969..42.969 rows=343 loops=1)
         Index Cond: (wordid = 65658)
 Total runtime: 4432.015 ms

The result has 343 rows and there are 9294 items in positions vectors in total.

--------------------------------------------------
Comparison with Lucene:

If I run the same query in Lucene search engine, it takes 0.105
seconds on the same data which is quite a huge difference.

--------------------------------------------------
Synthetic data set:

If you want to try it yourself, here is a script which generates the
data for COPY command. I don't know whether it is possible to send
attachments here, so I put the script inline. Just save it as
create_synthetic_data.pl and run it by 'perl
create_synthetic_data.pl'. With these synthetic data the SELECT query
times are around 2.5 seconds. You can try the SELECT query with
'wordid' equal 1, 2, 3, ...10000.


#!/usr/bin/perl
# Create synthetic data for PostgreSQL COPY.

$rows = 5490156;
$docs = 369;
$words = 161460;
$docid = 0;
$wordid = 0;

for ($row=0; $row<$rows; $row++) {

	my $sep          = "";
	my $positions    = "";
	my $weights      = "";
	my $total_weight = 0;
	my $items        = int(rand(39))+1;

	if ($row % int($rows/$docs) == 0) {
		$docid++;
		$wordid = 0;
	}
	$wordid++;

	for ($i=0; $i<$items; $i++) {
		$position      = int(rand(20000));
		$weight        = rand(1);
		$positions    .= $sep.$position;
		$weights      .= $sep.sprintf("%.3f", $weight);
		$total_weight += $weight;
		$sep           = ",";
	}
	print "$docid\t$wordid\t{$positions}\t{$weights}\t$items\t$total_weight\n";
}


If you need any other info, I will gladly provide it.

Thank You for Your time.
Miso Fapso

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux