Re: Hash join on int takes 8..114 seconds

Tomas Vondra <tv@xxxxxxxx> · Fri, 21 Nov 2008 22:22:55 +0100

2. Run the following commands periodically in this order:

VACUUM FULL;
vacuum full pg_shdepend;
CLUSTER rid on (toode);
CLUSTER dok  on (kuupaev);
REINDEX DATABASE mydb;
REINDEX SYSTEM mydb;
ANALYZE;

Are all those command required or can something leaved out ?

Running CLUSTER after VACUUM FULL is just a waste of time. In my 
experience CLUSTER is actually faster in case of such heavily bloated 
tables - I think this is caused by the fact that it creates indexes from 
the beginning instead of updating them (as VACUUM FULL does).

So CLUSTER actually performs REINDEX, so I'd just run

CLUSTER rid ON rid_pkey;
CLUSTER dok ON dok_pkey;
ANALYZE rid;
ANALYZE dok;

Clustering by other indexes might give better performance, using primary 
keys is just a safe guess here. This should improve the performance of 
your query and it seems these two tables are the most bloated ones.

I wouldn't do the same maintenance on the other tables now - it's just a 
waste of time.

Several other things to consider:

1) Regarding the toode column - why are you using CHAR(20) when the 
values
are actually shorter? This may significantly increase the amount of space
required.

There may be some products whose codes may be up to 20 characters.
PostgreSQL does not hold trailing spaces in db, so this does *not* 
affect to
space.

OK, I haven't realized this. You're right.

2) I've noticed the CPU used is Celeron, which may negatively affect the
speed of hash computation. I'd try to replace it by something faster - 
say
INTEGER as an artificial primary key of the "toode" table and using it as
a FK in other tables.  This might improve the "Bitmap Heap Scan on rid"
part, but yes - it's just a minor improvement compared to the "Hash Join"
part of the query.

Natural key Toode CHAR(20) is used widely in different queries. 
Replacing it with INT surrogate key requires major application rewrite.

Should I add surrogate index INT columns to toode and rid table and measure
test query speed in this case?

Test it. Create tables with fake data, and compare the performance with 
and without the surrogate keys. Using a simple data type instead of text 
 gave me huge performance boost. For example one of my colleagues used 
VARCHAR(15) to store IP addresses, and then used them to join tables 
(and suffered by the poor perfomance). Replacing them by INET improved 
the speed by several orders of magnitude.

Materialized views seem like a good idea to me, but maybe I'm not seeing
something. What do you mean by "reports are different"? If there is a lot
of rows for a given product / day, then creating an aggregated table with
(product code / day) as a primary key is quite simple. It may require a
lot of disk space, but it'll remove the hash join overhead. But if the
queries are very different, then it may be difficult to build such
materialized view(s).

log file seems that mostly only those queries are slow:

SELECT ...
  FROM dok JOIN rid USING (dokumnr)
  JOIN ProductId USING (ProductId)
  WHERE rid.ProductId LIKE :p1 || '%' AND dok.SaleDate>=:p2

:p1 and :p2 are parameters different for different queries.

dok contains several years of data. :p2 is usually only few previous months
or last year ago.
SELECT column list contains fixed list of known columns from all tables.

How to create index or materialized view to optimize this types of 
queries ?

Well, difficult to answer without detailed information about the queries 
you want to run, aggregated values, etc. Materialized views is a world 
on it's own, and the best solution depends on (for example):

1) what aggregated values are you interested in (additive values are the
   most primitive ones, while VARIANCE etc. make it difficult)

2) do you need current data, or is it OK that today's data are not
   available till midnight (for example)?

Another thing you have to consider is whether you want to create 
materialized view with final or intermediary data and then compute the 
final data somehow (for example monthly totals from daily totals).

The most primitive (but often sufficient) solution is recreating the 
materialized view periodically (for example every midnight). In your 
case it'd mean running something like

CREATE TABLE materialized_view AS SELECT ... your query here ...
GROUP BY productId, saleDate

This gives you daily totals for each product - the clients then can run 
another query to compute the final data.

But of course, if you need to maintain 'current' data, you may create a 
set of triggers to update the materialized view. Either after each 
modification or (more sophisticated) when it's needed.

See for example a great presentation from this year's PGCon:

http://www.pgcon.org/2008/schedule/events/69.en.html

regards
Tomas

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance