Hi Craig, Thank you for your answer. Here are my test table and indecies definitions: -- document id to category id create table doc_to_cat ( doc_id integer not null, cat_id integer not null ) with (oids=false); -- Total 1m rows. 500k unique document ids. 20k unique category ids. Each doc_id refers to exactly two cat_id. insert into doc_to_cat select i/50*25 + i%50 as doc_id, i/50 as cat_id from generate_series(0, 1000000) i create index doc_to_cat__doc_id on doc_to_cat using btree (doc_id); create index doc_to_cat__cat_id on doc_to_cat using btree (cat_id); -- document id to array of category ids create table doc_to_cat_arr ( doc_id integer not null, cat_arr integer[] not null ) with (oids=false); insert into doc_to_cat_arr select doc_id, int_array_aggregate(cat_id) as cat_arr from doc_to_cat group by doc_id create index doc_to_cat_arr__doc_id on doc_to_cat_arr using btree (doc_id); -- category id to array of document ids create table cat_to_doc_arr ( cat_id integer not null, doc_arr integer[] not null ) with (oids=false); insert into cat_to_doc_arr select cat_id, int_array_aggregate(doc_id) as doc_arr from doc_to_cat group by cat_id create index cat_to_doc_arr__doc_arr__gin on cat_to_doc_arr using gin (doc_arr gin__int_ops); -- Query Ids create table query_ids ( doc_id integer not null ) with (oids=false); -- 200k test ids for query with insert into query_ids select generate_series(100000, 300000); create index query_ids__doc_id on query_ids using btree (doc_id); Now queries plans. We are looking for cat_id having relations with 200k doc ids: explain analyze select distinct cat_id from doc_to_cat join query_ids using (doc_id) "Unique (cost=19303.84..19602.03 rows=20544 width=4) (actual time=1006.745..1190.472 rows=8002 loops=1)" " -> Sort (cost=19303.84..19452.93 rows=372735 width=4) (actual time=1006.743..1094.908 rows=400002 loops=1)" " Sort Key: doc_to_cat.cat_id" " Sort Method: quicksort Memory: 31039kB" " -> Merge Join (cost=2972.22..13785.04 rows=372735 width=4) (actual time=167.591..750.177 rows=400002 loops=1)" " Merge Cond: (query_ids.doc_id = doc_to_cat.doc_id)" " -> Index Scan using query_ids_doc_id on query_ids (cost=0.00..2912.05 rows=200001 width=4) (actual time=0.021..81.291 rows=200001 loops=1)" " -> Index Scan using doc_to_cat_doc_id on doc_to_cat (cost=0.00..14543.09 rows=1000001 width=8) (actual time=0.017..281.769 rows=599978 loops=1)" "Total runtime: 1195.397 ms" explain analyze select distinct int_array_enum(cat_arr) from doc_to_cat_arr join query_ids using (doc_id) "Unique (cost=13676.57..13836.57 rows=19732 width=29) (actual time=1061.490..1246.595 rows=8002 loops=1)" " -> Sort (cost=13676.57..13756.57 rows=200001 width=29) (actual time=1061.488..1150.451 rows=400002 loops=1)" " Sort Key: (int_array_enum(doc_to_cat_arr.cat_arr))" " Sort Method: quicksort Memory: 31039kB" " -> Merge Join (cost=2318.98..10859.01 rows=200001 width=29) (actual time=163.840..816.697 rows=400002 loops=1)" " Merge Cond: (doc_to_cat_arr.doc_id = query_ids.doc_id)" " -> Index Scan using doc_to_cat_arr_doc_id on doc_to_cat_arr (cost=0.00..11311.10 rows=500025 width=33) (actual time=0.022..359.673 rows=300002 loops=1)" " -> Index Scan using query_ids_doc_id on query_ids (cost=0.00..2912.05 rows=200001 width=4) (actual time=0.016..81.370 rows=200001 loops=1)" "Total runtime: 1251.613 ms" explain analyze select cat_id from cat_to_doc_arr where doc_arr && (select int_array_aggregate(q.doc_id) from (select doc_id from query_ids limit 20000) as q) This query should never be run - too slow even with limit 20k of input ids. So .. final best result is more than 1 second (on fast machine) for test dataset 5 times less than needed. So I am far away from achieving good results. I have to ask again if anyone faced similar situation, and is there any way to achive closer to optimal performance using postgresql functionality and extensibility? Chavdar Kopoev -----Original Message----- From: Craig Ringer [mailto:craig@xxxxxxxxxxxxxxxxxxxxx] Sent: 2008-11-26, 19:40:47 To: Chavdar Kopoev [mailto:chavdar@xxxxxxxxxx] Subject: Re: many to many performance Chavdar Kopoev wrote: > I want to use as a data storage postgresql. Tried several data structures, testing btree, gin, gist indecies over them, but best achieved performance for a 10 times smaller dataset (10k cat ids, 100k doc ids, 1m relations) is slower more than 5 times. Can you post your queries and table definitions so people trying to help you know what you did / tried to do? A downloadable self contained example might also be useful. Please also post the output of `EXPLAIN' on your queries, eg: EXPLAIN SELECT blah, ... FROM blah; > I read about postgresql bitmap indecies and "data lookup" when scanning indecies to get a value for current transaction. Downloaded latest v8.4 snapshot, compiled it, but as I see there is no bitmap index in it. Maybe if I download HEAD revision I will find them there, dont know. Bitmap index scans are an internal function that's used to combine two indexes on the fly during a query (or at least use both of them in one table scan). You don't make a bitmap index, you just make two ordinary btree indexes and let Pg take care of this for you. If you query on the columns of interest a lot, you might want to use a multi-column index instead. -- Craig Ringer -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance