Re: Hash join on int takes 8..114 seconds

"Andrus" <kobruleht2@xxxxxx> · Fri, 21 Nov 2008 21:00:09 +0200

Thomas,

Thank you.

Just the most important points:

1) "dok" table contains 1235086 row versions in 171641 pages (with 8kB
pages this means 1.4GB MB of data), but there are 1834279 unused item
pointers (i.e. about 60% of the space is wasted)

2) "rid" table contains 3275189 roiws in 165282 (with 8kB pages this means
about 1.3GB of data), but there are 1878923 unused item pointers (i.e.
about 30% of the space is wasted)

3) don't forget to execute analyze after vacuuming (or vacuum analyze)

autovacuum is running.
So if I understand properly, I must ran
VACUUM FULL ANALYZE dok;
VACUUM FULL ANALYZE rid;

Those commands cause server probably to stop responding to other client like
vacuum full pg_shdepend
did.

Should vacuum_cost_delay = 2000 allow other users to work when running those
commands ?

4) I'm not sure why the sizes reported by you (for example 2.3GB vs 1.5GB
for "doc" table) - the difference seems too large for me.

I used pg_total_relation_size(). So 2.3 GB includes indexes also:

  8        44286 dok_tasudok_idx                        245 MB
    10        44283 dok_klient_idx                         142 MB
   18        44288 dok_tasumata_idx                       91 MB
   19        44289 dok_tellimus_idx                       89 MB
    20        44284dok_krdokumnr_idx                      89 MB
    21        44285 dok_kuupaev_idx                        84 MB
    22        43531 makse_pkey                             77 MB
    23        43479 dok_pkey                               74 MB
    24        44282 dok_dokumnr_idx                        74 MB
    26     18663923 dok_yksus_pattern_idx                  43 MB
    27     18801591 dok_sihtyksus_pattern_idx              42 MB

Anyway the amount of wasted rows seems significant to me - I'd try to
solve this first. Either by VACUUM FULL or by CLUSTER. The CLUSTER will
lock the table exclusively, but the results may be better (when sorting by
a well chosen index). Don't forget to run ANALYZE afterwards.

How to invoke those commands so that other clients can continue work?
I'm using 8.1.4.
Log files show that autovacuum is running.

I'm planning the following solution:

1. Set

vacuum_cost_delay=2000

2. Run the following commands periodically in this order:

VACUUM FULL;
vacuum full pg_shdepend;
CLUSTER rid on (toode);
CLUSTER dok  on (kuupaev);
REINDEX DATABASE mydb;
REINDEX SYSTEM mydb;
ANALYZE;

Are all those command required or can something leaved out ?

Several other things to consider:

1) Regarding the toode column - why are you using CHAR(20) when the values
are actually shorter? This may significantly increase the amount of space
required.

There may be some products whose codes may be up to 20 characters.
PostgreSQL does not hold trailing spaces in db, so this does *not* affect to
space.

2) I've noticed the CPU used is Celeron, which may negatively affect the
speed of hash computation. I'd try to replace it by something faster - say
INTEGER as an artificial primary key of the "toode" table and using it as
a FK in other tables.  This might improve the "Bitmap Heap Scan on rid"
part, but yes - it's just a minor improvement compared to the "Hash Join"
part of the query.

Natural key Toode CHAR(20) is used widely in different queries. Replacing it
with
INT surrogate key requires major application rewrite.

Should I add surrogate index INT columns to toode and rid table and measure
test query speed in this case?

Materialized views seem like a good idea to me, but maybe I'm not seeing
something. What do you mean by "reports are different"? If there is a lot
of rows for a given product / day, then creating an aggregated table with
(product code / day) as a primary key is quite simple. It may require a
lot of disk space, but it'll remove the hash join overhead. But if the
queries are very different, then it may be difficult to build such
materialized view(s).

log file seems that mostly only those queries are slow:

SELECT ...
  FROM dok JOIN rid USING (dokumnr)
  JOIN ProductId USING (ProductId)
  WHERE rid.ProductId LIKE :p1 || '%' AND dok.SaleDate>=:p2

:p1 and :p2 are parameters different for different queries.

dok contains several years of data. :p2 is usually only few previous months
or last year ago.
SELECT column list contains fixed list of known columns from all tables.

How to create index or materialized view to optimize this types of queries ?

Andrus.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance