Re: What is the right way to deal with a table with rows that are not in a random order?

Douglas Alan <darkwater42@xxxxxxxxx> · Fri, 29 May 2009 16:12:46 -0400

Scott Marlowe wrote:

> Douglas Alan wrote:

>> Okay -- no problem:
>>
>>    set enable_seqscan = on;
>>    explain analyze select * from maindb_astobject
>>    where survey_id = 2
>>    limit 1000;
>>
>>    "Limit  (cost=0.00..48.03 rows=1000 width=78) (actual
>> time=84837.835..265938.258 rows=1000 loops=1)"
>>    "  ->  Seq Scan on maindb_astobject  (cost=0.00..3538556.10
>> rows=73675167 width=78) (actual time=84837.825..265932.121 rows=1000
>> loops=1)"
>>    "        Filter: (survey_id = 2)"
>>    "Total runtime: 265942.416 ms"
>>
>>    set enable_seqscan = off;
>>    explain analyze select * from maindb_astobject
>>    where survey_id = 2
>>    limit 1000;
>>
>>    "Limit  (cost=0.00..67.37 rows=1000 width=78) (actual
>> time=172.248..225.219 rows=1000 loops=1)"
>>    "  ->  Index Scan using maindb_astobject_survey_id on
>> maindb_astobject  (cost=0.00..4963500.87 rows=73675167 width=78)
>> (actual time=172.240..221.078 rows=1000 loops=1)"
>>    "        Index Cond: (survey_id = 2)"
>>    "Total runtime: 227.412 ms"

> What was the random_page_cost during these two queries?

4

> Assuming seq_page_cost is 1,

Yes, it is.

> and random_page_cost was 4 or something, lowering it should
> force the move to an index scan.

I just tried changing random_page_cost to 1, but the query still
does a seq scan.

> If you were already at 1.0 or so, then yeah, the cost
> estimation is off.  Since index scans cost CPU (at least I
> think they do), you might try lowering your cpu_* costs to see
> if that helps

How would lowering random_page_cost and all the cpu costs differ
from just increasing seq_page cost?

I have to raise seq_page_cost from 1 to 34 to force an index
scan.  I can't imagine that changing this value so radically be a
good idea.

Alternatively, if I set random_page_cost to 1, and
cpu_tuple_cost, cpu_index_tuple_cost, and cpu_operator_cost all
to 0, this still doesn't cause Postgres to do an index scan for
this query.

>> P.S. Here are the stats on the column. It appears that my recollection
>> of 25% of the table matching was a bit off. It's actually 98.5%!  That
>> might explain more why Postgres wants to do a sequential scan. The
>> problem is that still means that it has to scan a million rows
>> sequentially before it finds a single matching row, as the matching
>> rows are at the end of the database:

> Yeah, that's a really skewed distribution.  Partitioning may work out,
> especially if you often select on that one field.

Is there a way for me to alter the statistics table?  I tried
changing the values in pg_stats, but that table is just a view,
so Postgres won't let me do it.  pg_statistic, on the other hand,
is rather opaque.

Alternatively, can I delete the statistics for the column.  It's
the statistics that are hurting me here.

To delete the statistics, I tried setting statistics for the
column to 0 and analyzing the column, but that just left the
current statistics in place.  So I tried setting statistics to 1,
but that's one value to many to eliminate this problem!

|>ouglas

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general