Re: Query-Planer from 6seconds TO DAYS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Böckler Andreas wrote:
> Am 25.10.2012 um 20:22 schrieb Kevin Grittner:

>> The idea is to model actual costs on your system. You don't show
>> your configuration or describe your hardware, but you show an
>> estimate of retrieving over 4000 rows through an index and
>> describe a response time of 4 seconds, so you must have some
>> significant part of the data cached.
> Sure my effective_cache_size 10 GB
> But my right Table has the size of 1.2 TB (yeah Terra) at the
> moment (partitioned a 40GB slices) and has 3 * 10^9 records

You're getting up to a third of the size of what I've managed, so
we're in the same ballpark. I've gone to hundreds of millions of rows
in a table without partitioning with good performance. I realize in
raw rowcount for one table you're at almost ten times what I've run
that way, but I have no reason to think that it falls over between
those points. There are situations where partitioning helps, but I
have not found raw rowcount to be a very good basis for making the
call. What are your reasons for going that way?

> My left table has only the size of 227MB and 1million records.
> Peanuts.

Absolutely.

>> I would see how the workload behaves with the following settings:
>> 
>> effective_cache_size = <your shared_buffers setting plus what the
>> OS shows as cached pages>
>> seq_page_cost = 1
>> random_page_cost = 2
>> cpu_tuple_cost = 0.05
>> 
>> You can set these in a session and check the plan with EXPLAIN.
>> Try various other important important queries with these settings
>> and variations on them. Once you hit the right factors to model
>> your actual costs, the optimizaer will make better choices without
>> needing to tinker with it each time.
> 
>  i've played with that already ….
> 
> NESTED LOOP -> GOOD
> SEQSCAN -> VERY BAD
> 
> SET random_page_cost = 4;
> 2012-08-14' AND '2012-08-30' -> NESTED LOOP
> 2012-08-13' AND '2012-08-30' -> SEQSCAN
> SET random_page_cost = 2;
> 2012-08-14' AND '2012-08-30' -> NESTED LOOP
> 2012-08-07' AND '2012-08-30' -> NESTED LOOP
> 2012-08-06' AND '2012-08-30' -> SEQSCAN
> SET random_page_cost = 1;
> 2012-08-14' AND '2012-08-30' -> NESTED LOOP
> 2012-08-07' AND '2012-08-30' -> NESTED LOOP
> 2012-07-07' AND '2012-08-30' -> NESTED LOOP
> 2012-07-06' AND '2012-08-30' -> SEQSCAN

What impact did setting cpu_tuple_cost have?

> The thing is ..
> - You can alter what you want. The planner will switch at a certain
>   time range.
> - There is not one case, where the SEQSCAN-Method will be better ..
>   It's not possible.

I would be interested to see what you consider to be the proof of
that. In most benchmarks where people have actually measured it, the
seqscan becomes faster when you are selecting more than about 10% of
a table, since the index scan will be jumping all over the disk to
read the index pages and the actual data in the heap, which a seqscan
can take advantage of the OS's readahead. (Perhaps you need to tweak
that OS setting?) Of course, the more the data is cached, the less
penalty there is for random access and the less attractive seqscans
become.

Any attempt to force plans using sequential scans to always be
ignored is sure to make some types of queries slower -- sometimes
much slower. Hints or other ways to force a plan are far inferior to
modelling costs better. You might want to give that a try.

-Kevin


-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux