Re: Performance problem with low correlation data

m_lists@xxxxxxxx · Thu, 9 Jul 2009 07:38:56 +0000 (GMT)

Since noone replied to http://www.mail-archive.com/pgsql-general@xxxxxxxxxxxxxx/msg133360.html, I tried another approach:

I can't cluster the whole table every day; it would take too much (as I said, table as 60M rows, and I have hundreds of them).
Plus, it wouldn't really make much sense: the only portion of table to be clustered is the one written after the last "cluster" command (since no row is deleted/updated, only inserted each 15 minutes).
So I thought: I'll "cluster" only the part that has been written every day:

begin; 
lock table testinsert in ACCESS EXCLUSIVE MODE;
insert into testinsert select ne_id+100000, t, v from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00'  order by ne_id,t;
DELETE from testinsert where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00' and ne_id<100000;
update testinsert set ne_id = ne_id - 100000 where t between '2009-08-01 00:00:00' and '2009-08-02 00:00:00';
commit;

this would run after midnight of 2009-08-02. Next day would have different time values.
What I'm trying to do here is cluster on ne_id,t the portion of table written every day.
Well, I guess the table is layed out as expected, but in pg_stats correlation for the ne_id col is still VERY low:
select attname,n_distinct,correlation from pg_stats where tablename='testinsert3'; 
attname | n_distinct | correlation
---------+------------+-------------
ne_id   |      20000 |    0.111041  <---- low value
t       |        864 |    0.987778
v       |          1 |           1

this leads the planner to sequence scans of the table as soon as 10% of the table has to be read: 
explain  select * FROM idtable as g      inner  join testinsert on id=ne_id where  groupid between 1 and 4 and  t between '2009-08-01 00:00:00' and '2009-08-09 00:00:00' 
Hash Join  (cost=134.45..2127071.28 rows=614331 width=244)
   Hash Cond: (testinsert3.ne_id = g.id)
   ->  Seq Scan on testinsert  (cost=0.00..2063200.00 rows=15358272 width=236)
               Filter: ((t >= '2009-08-01 00:00:00'::timestamp without time zone) AND (t <= '2009-08-09 00:00:00'::timestamp without time zone))
   ->  Hash  (cost=124.45..124.45 rows=800 width=8)
               ->  Bitmap Heap Scan on idtable g  (cost=24.45..124.45 rows=800 width=8)
                          Recheck Cond: ((groupid >= 1) AND (groupid <= 4))
                               ->  Bitmap Index Scan on idtable_pk  (cost=0.00..24.25 rows=800 width=0)
                                             Index Cond: ((groupid >= 1) AND (groupid <= 4))

Which is a terrible plan! 
testinsert contains t values between '2009-08-01' and '2009-08-09', and ne_id from 1 to 20000. But only 800 out of 20000 ne_id have to be read; there's no need for a table scan!
I guess this is a reflection of the poor "correlation" on ne_id; but, as I said, I don't really think ne_id is so bad correlated.
In fact, doing a "select ne_id, t from testinsert limit 100000"  I can see that data is laid out pretty much by "ne_id, t", grouped by day (that is, same ne_id for one day, then next ne_id and so on until next day).
How is the "correlation" calculated? Can someone explain to me why, after the procedure above,correlation is so low???

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general