Query performance help with 'shadow table' approach.

Brian Fehrle <brianf@xxxxxxxxxxxxxxxxxxx> · Tue, 13 Sep 2011 15:44:52 -0600

Hi all,
    I've got a large table that has 15 million + rows in it, and a set 
of queries I've been trying to speed up. The table has a primary key 
column, and a couple hundred other columns.

These queries basically do a 'select max(primary_key_column) from table 
group by column1, column2." Because of the group by, we would result in 
a sequential scan of the entire table which proves to be costly.

Since the table has a ton of columns, I set up a smaller table that will 
house a copy of some of the data that the query uses, the Primary Key 
colum, and the two columns I do my 'group by' on. My application is 
smart enough to update this 'shadow' table whenever the main table is 
updated, so it will accurately mirror the other table. This shadow table 
will also only contain one row for every column1 and column2 combination 
(due to the group by), and for those rows, will have the max of the 
primary key. Even with this, the 'shadow' table will have about 14 
million rows, compared to the 15 million in the main table.

Here is an example query that I'm working with:
postgres=# explain select T2.pkey_sid, T2.column54, T2.column44. 
T2.column67 FROM
public.mytable AS T2
JOIN public.mytable_shadow AS T3
        ON (T2.pkey_sid = T3.pkey_sid)
WHERE T3.column1 >= 1072310434 AND T3.column1 <= 1074124834;
                                              QUERY PLAN
-------------------------------------------------------------------------------------------------------
 Hash Join  (cost=118310.65..2250928.27 rows=409600 width=8)
   Hash Cond: (t2.pkey_sid = t3.pkey_sid)
   ->  Seq Scan on mytable t2  (cost=0.00..2075725.51 rows=15394251 
width=8)
   ->  Hash  (cost=113190.65..113190.65 rows=409600 width=8)
         ->  Bitmap Heap Scan on mytable_shadow t3 
(cost=12473.65..113190.65 rows=409600 width=8)
               Recheck Cond: ((1072310434 <= column1) AND (column1 <= 
1074124834))
               ->  Bitmap Index Scan on mytable_shadow_pkey  
(cost=0.00..12371.25 rows=409600 width=0)
                     Index Cond: ((1072310434 <= column1) AND (column1 
<= 1074124834))
(8 rows)

So the issue here comes in retrieving the needed data from my main 
table. The resulting rows is estimated to be 409,600, and the retrieving 
of the primary key's that are associated with those rows is actually 
really easy. However, when we take those 409,600 rows back to the main 
table to retrieve the other columns I need, the planner is just doing a 
sequential scan as it's most likely going to be faster than hitting the 
index then retrieving the columns I need for all 400K+ rows.

Things to note:
1. If I reduce my where clause's range, then the sequential scan turns 
into an index scan, but sadly this can't always be done.
2. I have appropriate indexes where they need to be. The issue is in the 
query planner not using them due to it (i assume) just being faster to 
scan the whole table when the data set it needs is as large as it is.
3. Without this shadow table, my query would look _something_ like this 
(The idea being, retrieve a certain set of columns from the rows with 
the max(primary key) based on my group by):
select pkey_sid, column54, column44, column47\\67 from public.mytable 
where pkey_sid in (select max(pkey_sid) from public.mytable group by 
column1, column2);

So I need to see how I can speed this up. Is my approach misguided, or 
are there other ways I can go about it? Any thoughts, suggestions, or 
info would be greatly appreciated. And I tried to explain it all easily, 
if I can be more clear let me know.

Thanks,
- Brian F

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general