2. Run the following commands periodically in this order:
VACUUM FULL;
vacuum full pg_shdepend;
CLUSTER rid on (toode);
CLUSTER dok on (kuupaev);
REINDEX DATABASE mydb;
REINDEX SYSTEM mydb;
ANALYZE;
Are all those command required or can something leaved out ?
Running CLUSTER after VACUUM FULL is just a waste of time. In my
experience CLUSTER is actually faster in case of such heavily bloated
tables - I think this is caused by the fact that it creates indexes from
the beginning instead of updating them (as VACUUM FULL does).
So CLUSTER actually performs REINDEX, so I'd just run
CLUSTER rid ON rid_pkey;
CLUSTER dok ON dok_pkey;
ANALYZE rid;
ANALYZE dok;
Clustering by other indexes might give better performance, using primary
keys is just a safe guess here. This should improve the performance of
your query and it seems these two tables are the most bloated ones.
I wouldn't do the same maintenance on the other tables now - it's just a
waste of time.
Several other things to consider:
1) Regarding the toode column - why are you using CHAR(20) when the
values
are actually shorter? This may significantly increase the amount of space
required.
There may be some products whose codes may be up to 20 characters.
PostgreSQL does not hold trailing spaces in db, so this does *not*
affect to
space.
OK, I haven't realized this. You're right.
2) I've noticed the CPU used is Celeron, which may negatively affect the
speed of hash computation. I'd try to replace it by something faster -
say
INTEGER as an artificial primary key of the "toode" table and using it as
a FK in other tables. This might improve the "Bitmap Heap Scan on rid"
part, but yes - it's just a minor improvement compared to the "Hash Join"
part of the query.
Natural key Toode CHAR(20) is used widely in different queries.
Replacing it with INT surrogate key requires major application rewrite.
Should I add surrogate index INT columns to toode and rid table and measure
test query speed in this case?
Test it. Create tables with fake data, and compare the performance with
and without the surrogate keys. Using a simple data type instead of text
gave me huge performance boost. For example one of my colleagues used
VARCHAR(15) to store IP addresses, and then used them to join tables
(and suffered by the poor perfomance). Replacing them by INET improved
the speed by several orders of magnitude.
Materialized views seem like a good idea to me, but maybe I'm not seeing
something. What do you mean by "reports are different"? If there is a lot
of rows for a given product / day, then creating an aggregated table with
(product code / day) as a primary key is quite simple. It may require a
lot of disk space, but it'll remove the hash join overhead. But if the
queries are very different, then it may be difficult to build such
materialized view(s).
log file seems that mostly only those queries are slow:
SELECT ...
FROM dok JOIN rid USING (dokumnr)
JOIN ProductId USING (ProductId)
WHERE rid.ProductId LIKE :p1 || '%' AND dok.SaleDate>=:p2
:p1 and :p2 are parameters different for different queries.
dok contains several years of data. :p2 is usually only few previous months
or last year ago.
SELECT column list contains fixed list of known columns from all tables.
How to create index or materialized view to optimize this types of
queries ?
Well, difficult to answer without detailed information about the queries
you want to run, aggregated values, etc. Materialized views is a world
on it's own, and the best solution depends on (for example):
1) what aggregated values are you interested in (additive values are the
most primitive ones, while VARIANCE etc. make it difficult)
2) do you need current data, or is it OK that today's data are not
available till midnight (for example)?
Another thing you have to consider is whether you want to create
materialized view with final or intermediary data and then compute the
final data somehow (for example monthly totals from daily totals).
The most primitive (but often sufficient) solution is recreating the
materialized view periodically (for example every midnight). In your
case it'd mean running something like
CREATE TABLE materialized_view AS SELECT ... your query here ...
GROUP BY productId, saleDate
This gives you daily totals for each product - the clients then can run
another query to compute the final data.
But of course, if you need to maintain 'current' data, you may create a
set of triggers to update the materialized view. Either after each
modification or (more sophisticated) when it's needed.
See for example a great presentation from this year's PGCon:
http://www.pgcon.org/2008/schedule/events/69.en.html
regards
Tomas
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance