Slow query on CLUTER -ed tables

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Given two tables:

CREATE TABLE product_price_history
(
  hid bigint NOT NULL,
  hdate timestamp without time zone NOT NULL,
  id bigint NOT NULL,
  product_id bigint NOT NULL,
.... more columns here
  CONSTRAINT pk_product_price_history PRIMARY KEY (hid);

CREATE INDEX idx_product_price_history_id_hdate
  ON product_price_history
  USING btree
  (id, hdate);


CREATE TABLE product_price_offer_history
(
  hid bigint NOT NULL,
  product_price_id bigint NOT NULL,
  isfeatured smallint NOT NULL,
  price double precision NOT NULL,
  shipping double precision NOT NULL,
    .... some more coumns here
  CONSTRAINT pk_product_price_offer_history PRIMARY KEY (hid, offerno)
);

Stats:

product_price_history - tablesize=23GB, indexes size=4GB, row count = 87 million product_price_offer_history - tablesize=24GB, indexes size=7GB, row count = 235 million


These tables store historical data of some million products from the last year.
The following commands are executed on them daily:

CLUSTER idx_product_price_history_id_hdate on product_price_history;
CLUSTER pk_product_price_offer_history on product_price_offer_history;

Here is a query:

select
  date_part('epoch', min(pph.hdate) )  as hdate_ticks,
  min(ppoh.price+ppoh.shipping) as price_plus_shipping
from
  product_price_history pph
  inner join product_price_offer_history ppoh on ppoh.hid = pph.hid
where pph.id = 37632081
 and ppoh.isfeatured=1
group by ppoh.merchantid,pph.hid,pph.hdate
order by pph.hid asc


I think that the query plan is correct:


"GroupAggregate  (cost=5553554.25..5644888.17 rows=2283348 width=50)"
"  ->  Sort  (cost=5553554.25..5559262.62 rows=2283348 width=50)"
"        Sort Key: pph.hid, ppoh.merchantid, pph.hdate"
"        ->  Nested Loop  (cost=0.00..5312401.66 rows=2283348 width=50)"
" -> Index Scan using idx_product_price_history_id_hdate on product_price_history pph (cost=0.00..8279.80 rows=4588 width=16)"
"                    Index Cond: (id = 37632081)"
" -> Index Scan using pk_product_price_offer_history on product_price_offer_history ppoh (cost=0.00..1149.86 rows=498 width=42)"
"                    Index Cond: (ppoh.hid = pph.hid)"
"                    Filter: (ppoh.isfeatured = 1)"

So it uses two index scans on the indexes we CLUSTER the tables on. Number of rows returned is usually between 100 and 20 000.


Here is the problem. When I first open this query for a given identifier, it runs for 100 seconds. When I try to run it again for the same identifier it returns the same rows within one second!

The indexes are very well conditioned: from the 235 million rows, any id given occurs at most 20 000 times. It is a btree index, so it should already be stored sorted, and the 20 000 rows to be returned should fit into a few database pages. Even if they are not in the cache, PostgreSQL should be able to read the required pages within a second.

I understand that for an index scan, PostgreSQL also needs to read the rows from the table. But since these tables are CLUSTER-ed on those specific indexes, all the data needed shoud fit on a few database pages and PostgreSQL should be able to read them within a second.

Then why it is taking 100 seconds to do the query for the first time and why it is just one sec for the second time? Probably my thinking is wrong, but I suppose it means that the data is spread on thousands of pages on the disk.

How is that possible? What am I doing wrong?

Thanks,

   Laszlo


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux