Search Postgresql Archives

Re: select distinct and index usage

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Apr 7, 2008, at 9:47 AM, David Wilson wrote:
On Mon, Apr 7, 2008 at 2:05 AM, Alban Hertroys
<dalroi@xxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
The databases estimates seem consistent with yours, so why is it doing this? Could you provide an EXPLAIN ANALYSE? It shows the actual numbers next
to the estimates, although I figure that query might take a while...

explain analyze select distinct datestamp from vals;
                                                               QUERY
PLAN
---------------------------------------------------------------------- -------------------------------------------------------------------
 Unique  (cost=15003047.47..15380004.83 rows=4263 width=4) (actual
time=649599.159..721671.595 rows=4252 loops=1)
   ->  Sort  (cost=15003047.47..15191526.15 rows=75391472 width=4)
(actual time=649599.157..694392.602 rows=75391476 loops=1)
         Sort Key: datestamp
         Sort Method:  external merge  Disk: 1178592kB
         ->  Seq Scan on vals  (cost=0.00..1531261.72 rows=75391472
width=4) (actual time=9.104..93130.468 rows=75391476 loops=1)
 Total runtime: 722379.434 ms

Wow, great estimates! The planner obviously knows how your data is structured. So much for the bad planner estimation scenario...

I haven't seen this "external merge Disk"-sort method before, maybe it's new in 8.3, but it doesn't look promising for query performance. Considering it's using 1.1GB it seems the planner may have chosen for the least memory exhaustive method; I have to admit I don't know the planner in that much detail. Take this with a grain of salt, but my guess is that as the index is even bigger, the planner figures this approach would involve the least disk i/o and will therefore be faster.

Have you tried this query with enable_seqscan=off? If my guess is right (and the planners, in that case) it'd be even slower.

Something that might help you, but I'm not sure whether it might hurt the performance of other queries, is to cluster that table on val_datestamp_idx. That way the records are already (mostly) sorted on disk in the order of the datestamps, which seems to be the brunt of above query plan.

Pg estimates the costs quite high too. It's almost as if there isn't an index on that column and it has no other way then doing a sequential scan... Could you show us the table definition and its indexes? What version of Pg
is this?

Pg is 8.3.1

Table definition:
CREATE TABLE vals (
    sid integer NOT NULL,
    eid integer NOT NULL,
    datestamp integer NOT NULL,
    val_dur integer NOT NULL,
    acc real NOT NULL,
    yld real NOT NULL,
    rt real NOT NULL,
    ydev real NOT NULL,
    vydev real NOT NULL,
    adev real NOT NULL,
    achange real NOT NULL,
    ychange real NOT NULL,
    arsi real NOT NULL,
    yrsi real NOT NULL,
    UNIQUE (sid,eid,val_dur,datestamp),
    FOREIGN KEY (sid,eid,datestamp) REFERENCES preds
(sid,eid,datestamp) ON DELETE CASCADE
);
create index val_datestamp_idx on vals(datestamp);
create index val_evaluator_idx on vals(eid);
create index val_search_key on vals(val_dur,eid,datestamp);
create index val_vd_idx on vals(val_dur,datestamp);

There seems to be quite a bit of overlap in your index definitions. From my experience this can confuse the planner.

I suggest you combine them, but not knowing your data... Maybe rewriting your UNIQUE constraint to (val_dur, datestamp, eid, sid) would be enough to replace all those other indexes. If not, it's probably better to have one index per column, so that the planner is free to combine them as it sees fit. That'd result in a bitmap index scan, btw.

(The various indices are for a variety of common queries into the table)

It may be that your index on vals.datestamp doesn't fit into memory; what
are the relevant configuration parameters for your database?

That's a very good question. I recently had to rebuild this particular
database and haven't played with the configuration parameters as much
as I'd like- what parameters would be most relevant here? I hadn't
realized that an index needed to fit into memory.

Well, it doesn't _need_ to fit into memory, but if the database needs to fetch different parts of it from disk or swap, the costs of using the index will sear up. Especially random access would be bad.

Anything that fits entirely into memory will be faster than having to fetch it from disk, as long as it doesn't mean other things will have to come from disk instead.

I'm not a postgresql tuning expert (I know my way around though), other people can explain you way better than I can. Bruce Momjian for example: http://www.linuxjournal.com/article/4791

pg_total_relation_size('vals') - pg_relation_size('vals') gives 11gb.
All indexed columns are integers. My guess is that this means that
it's likely the index doesn't fit into memory.

That calculation doesn't look familiar to me, I'm more used to:
 select pg_size_pretty(pg_relation_size('...'));

You can put the name of any relation in there, be it tables, indexes, etc.

11GB is pretty large for an index on an integer column, especially with only 75M rows: that's 146 bytes/row in your index. Maybe your index got bloated somehow? I think it should be about a tenth of that.

Regards,
Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,47fa5cf0927661607113844!




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux