Re: More correlated (?) index woes

rob stone <floriparob@xxxxxxxxx> · Tue, 29 Mar 2016 08:01:04 +1100

On Mon, 2016-03-28 at 20:23 +0100, Geoff Winkless wrote:
> So I accept that when using MIN(sc_id) against scdate it makes
> statistical sense to use the sc_id index for a reasonable percentage
> of the full range of scdate, unless we know in advance that scdate is
> closely correlated to sc_id (because using MIN means we can stop
> immediately we hit a value).
> 
> However I'm now finding a similar problem when using a multi-table
> DELETE, where the same obviously can't apply.
> 
> This query:
> 
> DELETE FROM pa USING legs WHERE legs.scdate BETWEEN 20160220 AND
> 20160222 AND legs.sc_id=pa.sc_id;
> 
> does what one would hope:
> 
>  Delete on pa  (cost=0.99..705406.24 rows=36570 width=12)
>    ->  Nested Loop  (cost=0.99..705406.24 rows=36570 width=12)
>          ->  Index Scan using legs_scdate_idx on legs
>  (cost=0.43..1171.88 rows=14458 width=10)
>                Index Cond: ((scdate >= 20160220) AND (scdate <=
> 20160222))
>          ->  Index Scan using pa_pkey on pa  (cost=0.56..48.33
> rows=38 width=10)
>                Index Cond: (sc_id = legs.sc_id)
> 
> 
> However as soon as I add an extra test for field1 IS NULL, it
> apparently goes insane:
> 
>  Delete on pa  (cost=577260.90..626296.46 rows=23732 width=12)
> (actual time=41870.770..41870.770 rows=0 loops=1)
>    ->  Hash Join  (cost=577260.90..626296.46 rows=23732 width=12)
> (actual time=37886.396..41315.668 rows=44960 loops=1)
>          Hash Cond: (legs.sc_id = pa.sc_id)
>          ->  Index Scan using legs_scdate_idx on legs
>  (cost=0.43..1171.88 rows=14458 width=10) (actual time=0.030..13.667
> rows=21281 loops=1)
>                Index Cond: ((scdate >= 20160220) AND (scdate <=
> 20160222))
>          ->  Hash  (cost=481691.12..481691.12 rows=5497868 width=10)
> (actual time=37805.756..37805.756 rows=4875870 loops=1)
>                Buckets: 131072  Batches: 64  Memory Usage: 4311kB
>                ->  Seq Scan on pa  (cost=0.00..481691.12 rows=5497868
> width=10) (actual time=0.008..35869.304 rows=4875870 loops=1)
>                      Filter: (field1 IS NULL)
>                      Rows Removed by Filter: 2688634
>  Planning time: 0.447 ms
>  Execution time: 41870.832 ms
> 
> Running ANALYZE makes no difference.
> 
> Table pa has 7522676 rows, 4834042 of which have field1 NULL, so it's
> absolutely not reasonable to expect this to be an optimal strategy.
> 
> Any suggestions as to how I can improve this query?
> 
> Thanks :)
> 
> Geoff

What does:-

DELETE FROM pa 
WHERE pa.field1 IS NULL
AND pa.sc_id IN (SELECT legs.sc_id FROM legs
WHERE legs.scdate BETWEEN 20160220 AND > 20160222)

give as a cost when you run ANALYZE over it?

HTH

Rob

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general