Re: Hash join on int takes 8..114 seconds

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



2. Run the following commands periodically in this order:

VACUUM FULL;
vacuum full pg_shdepend;
CLUSTER rid on (toode);
CLUSTER dok  on (kuupaev);
REINDEX DATABASE mydb;
REINDEX SYSTEM mydb;
ANALYZE;

Are all those command required or can something leaved out ?

Running CLUSTER after VACUUM FULL is just a waste of time. In my experience CLUSTER is actually faster in case of such heavily bloated tables - I think this is caused by the fact that it creates indexes from the beginning instead of updating them (as VACUUM FULL does).

So CLUSTER actually performs REINDEX, so I'd just run

CLUSTER rid ON rid_pkey;
CLUSTER dok ON dok_pkey;
ANALYZE rid;
ANALYZE dok;

Clustering by other indexes might give better performance, using primary keys is just a safe guess here. This should improve the performance of your query and it seems these two tables are the most bloated ones.

I wouldn't do the same maintenance on the other tables now - it's just a waste of time.


Several other things to consider:

1) Regarding the toode column - why are you using CHAR(20) when the values
are actually shorter? This may significantly increase the amount of space
required.

There may be some products whose codes may be up to 20 characters.
PostgreSQL does not hold trailing spaces in db, so this does *not* affect to
space.

OK, I haven't realized this. You're right.

2) I've noticed the CPU used is Celeron, which may negatively affect the
speed of hash computation. I'd try to replace it by something faster - say
INTEGER as an artificial primary key of the "toode" table and using it as
a FK in other tables.  This might improve the "Bitmap Heap Scan on rid"
part, but yes - it's just a minor improvement compared to the "Hash Join"
part of the query.

Natural key Toode CHAR(20) is used widely in different queries. Replacing it with INT surrogate key requires major application rewrite.

Should I add surrogate index INT columns to toode and rid table and measure
test query speed in this case?

Test it. Create tables with fake data, and compare the performance with and without the surrogate keys. Using a simple data type instead of text gave me huge performance boost. For example one of my colleagues used VARCHAR(15) to store IP addresses, and then used them to join tables (and suffered by the poor perfomance). Replacing them by INET improved the speed by several orders of magnitude.

Materialized views seem like a good idea to me, but maybe I'm not seeing
something. What do you mean by "reports are different"? If there is a lot
of rows for a given product / day, then creating an aggregated table with
(product code / day) as a primary key is quite simple. It may require a
lot of disk space, but it'll remove the hash join overhead. But if the
queries are very different, then it may be difficult to build such
materialized view(s).

log file seems that mostly only those queries are slow:

SELECT ...
  FROM dok JOIN rid USING (dokumnr)
  JOIN ProductId USING (ProductId)
  WHERE rid.ProductId LIKE :p1 || '%' AND dok.SaleDate>=:p2

:p1 and :p2 are parameters different for different queries.

dok contains several years of data. :p2 is usually only few previous months
or last year ago.
SELECT column list contains fixed list of known columns from all tables.

How to create index or materialized view to optimize this types of queries ?

Well, difficult to answer without detailed information about the queries you want to run, aggregated values, etc. Materialized views is a world on it's own, and the best solution depends on (for example):

1) what aggregated values are you interested in (additive values are the
   most primitive ones, while VARIANCE etc. make it difficult)

2) do you need current data, or is it OK that today's data are not
   available till midnight (for example)?

Another thing you have to consider is whether you want to create materialized view with final or intermediary data and then compute the final data somehow (for example monthly totals from daily totals).

The most primitive (but often sufficient) solution is recreating the materialized view periodically (for example every midnight). In your case it'd mean running something like

CREATE TABLE materialized_view AS SELECT ... your query here ...
GROUP BY productId, saleDate

This gives you daily totals for each product - the clients then can run another query to compute the final data.

But of course, if you need to maintain 'current' data, you may create a set of triggers to update the materialized view. Either after each modification or (more sophisticated) when it's needed.

See for example a great presentation from this year's PGCon:

http://www.pgcon.org/2008/schedule/events/69.en.html


regards
Tomas

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux