Re: many to many performance

"Chavdar Kopoev" <chavdar@xxxxxxxxxx> · Thu, 27 Nov 2008 18:39:51 +0200

Hi Craig,

Thank you for your answer.

Here are my test table and indecies definitions:

-- document id to category id
create table doc_to_cat (
  doc_id integer not null,
  cat_id integer not null
) with (oids=false);

-- Total 1m rows. 500k unique document ids. 20k unique category ids. Each doc_id refers to exactly two cat_id. 
insert into doc_to_cat
select i/50*25 + i%50 as doc_id, i/50 as cat_id from generate_series(0, 1000000) i

create index doc_to_cat__doc_id on doc_to_cat using btree (doc_id);
create index doc_to_cat__cat_id on doc_to_cat using btree (cat_id);

-- document id to array of category ids
create table doc_to_cat_arr (
  doc_id integer not null,
  cat_arr integer[] not null
) with (oids=false);

insert into doc_to_cat_arr
select doc_id, int_array_aggregate(cat_id) as cat_arr
from doc_to_cat
group by doc_id

create index doc_to_cat_arr__doc_id on doc_to_cat_arr using btree (doc_id);

-- category id to array of document ids
create table cat_to_doc_arr (
  cat_id integer not null,
  doc_arr integer[] not null
) with (oids=false);

insert into cat_to_doc_arr
select cat_id, int_array_aggregate(doc_id) as doc_arr
from doc_to_cat
group by cat_id

create index cat_to_doc_arr__doc_arr__gin on cat_to_doc_arr using gin (doc_arr gin__int_ops);

-- Query Ids
create table query_ids (
  doc_id integer not null
) with (oids=false);

-- 200k test ids for query with
insert into query_ids
select generate_series(100000, 300000);

create index query_ids__doc_id on query_ids using btree (doc_id);

Now queries plans. We are looking for cat_id having relations with 200k doc ids:

explain analyze
select distinct cat_id from doc_to_cat join query_ids using (doc_id)

"Unique  (cost=19303.84..19602.03 rows=20544 width=4) (actual time=1006.745..1190.472 rows=8002 loops=1)"
"  ->  Sort  (cost=19303.84..19452.93 rows=372735 width=4) (actual time=1006.743..1094.908 rows=400002 loops=1)"
"        Sort Key: doc_to_cat.cat_id"
"        Sort Method:  quicksort  Memory: 31039kB"
"        ->  Merge Join  (cost=2972.22..13785.04 rows=372735 width=4) (actual time=167.591..750.177 rows=400002 loops=1)"
"              Merge Cond: (query_ids.doc_id = doc_to_cat.doc_id)"
"              ->  Index Scan using query_ids_doc_id on query_ids  (cost=0.00..2912.05 rows=200001 width=4) (actual time=0.021..81.291 rows=200001 loops=1)"
"              ->  Index Scan using doc_to_cat_doc_id on doc_to_cat  (cost=0.00..14543.09 rows=1000001 width=8) (actual time=0.017..281.769 rows=599978 loops=1)"
"Total runtime: 1195.397 ms"

explain analyze
select distinct int_array_enum(cat_arr) from doc_to_cat_arr join query_ids using (doc_id)
"Unique  (cost=13676.57..13836.57 rows=19732 width=29) (actual time=1061.490..1246.595 rows=8002 loops=1)"
"  ->  Sort  (cost=13676.57..13756.57 rows=200001 width=29) (actual time=1061.488..1150.451 rows=400002 loops=1)"
"        Sort Key: (int_array_enum(doc_to_cat_arr.cat_arr))"
"        Sort Method:  quicksort  Memory: 31039kB"
"        ->  Merge Join  (cost=2318.98..10859.01 rows=200001 width=29) (actual time=163.840..816.697 rows=400002 loops=1)"
"              Merge Cond: (doc_to_cat_arr.doc_id = query_ids.doc_id)"
"              ->  Index Scan using doc_to_cat_arr_doc_id on doc_to_cat_arr  (cost=0.00..11311.10 rows=500025 width=33) (actual time=0.022..359.673 rows=300002 loops=1)"
"              ->  Index Scan using query_ids_doc_id on query_ids  (cost=0.00..2912.05 rows=200001 width=4) (actual time=0.016..81.370 rows=200001 loops=1)"
"Total runtime: 1251.613 ms"

explain analyze
select cat_id from cat_to_doc_arr
where doc_arr && (select int_array_aggregate(q.doc_id) from (select doc_id from query_ids limit 20000) as q)
This query should never be run - too slow even with limit 20k of input ids.

So .. final best result is more than 1 second (on fast machine) for test dataset 5 times less than needed. So I am far away from achieving good results.
I have to ask again if anyone faced similar situation, and is there any way to achive closer to optimal performance using postgresql functionality and extensibility?

Chavdar Kopoev

-----Original Message-----
From: Craig Ringer [mailto:craig@xxxxxxxxxxxxxxxxxxxxx]
Sent: 2008-11-26, 19:40:47
To: Chavdar Kopoev [mailto:chavdar@xxxxxxxxxx]
Subject: Re:  many to many performance

Chavdar Kopoev wrote:

> I want to use as a data storage postgresql. Tried several data structures, testing btree, gin, gist indecies over them, but best achieved performance for a 10 times smaller dataset (10k cat ids, 100k doc ids, 1m relations) is slower more than 5 times.

Can you post your queries and table definitions so people trying to help
you know what you did / tried to do? A downloadable self contained
example might also be useful.

Please also post the output of `EXPLAIN' on your queries, eg:

EXPLAIN SELECT blah, ... FROM blah;

> I read about postgresql bitmap indecies and "data lookup" when scanning indecies to get a value for current transaction. Downloaded latest v8.4 snapshot, compiled it, but as I see there is no bitmap index in it. Maybe if I download HEAD revision I will find them there, dont know.

Bitmap index scans are an internal function that's used to combine two
indexes on the fly during a query (or at least use both of them in one
table scan). You don't make a bitmap index, you just make two ordinary
btree indexes and let Pg take care of this for you.

If you query on the columns of interest a lot, you might want to use a
multi-column index instead.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance