Search Postgresql Archives

Re: Query planner riddle (array-related?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tom,

On Fri, May 04, 2018 at 09:32:08AM -0400, Tom Lane wrote:
> Markus <m@xxxxxxx> writes:
> > I'm running Postgresql 9.6.7 on Debian stretch -- and I'm trying to
> > understand a query plan, with any hint where to gain further insight
> > welcome.
> 
> Well, you say
> 
> >   select count(*) from gaia.dr2light where parallax>50;
> > gives 5400 rows in no time.
> 
> but the planner thinks there are 12991627 such rows:

Ah... yeah, the parallax distribution is fairly sharply peaked around
0, so >50 might be severely off.

So, I've run

  alter table gaia.dr2light alter parallax set statistics 1000;
  analyze gaia.dr2light;

and lo and behold, the both queries become a good deal faster (a
couple of seconds).  That's essentially enough to make me happy --
thanks!

> Also, this sort of thing is usually much easier to diagnose from
> EXPLAIN ANALYZE output.  All we can see from these queries is that
> the planner picked what it thought was the lowest-cost plan.  Without
> actual rowcounts it's very hard to guess why the estimates were wrong.
> You happened to provide one actual-rowcount number that maybe was
> enough to diagnose the issue; but if the above doesn't do the trick,
> we're going to need to see EXPLAIN ANALYZE to guess what else is up.

With this, the query plans converge to trivial variations of

 Hash Join  (cost=253856.23..4775113.84 rows=422 width=1647) (actual time=1967.095..2733.109 rows=18 loops=1)
   Hash Cond: (dr2light.source_id = dr2epochflux.source_id)
   ->  Bitmap Heap Scan on dr2light  (cost=24286.88..4385601.28 rows=1297329 width=132) (actual time=3.113..19.346 rows=5400 loops=1)
         Recheck Cond: (parallax > '50'::double precision)
         Heap Blocks: exact=5184
         ->  Bitmap Index Scan on dr2light_parallax  (cost=0.00..23962.54 rows=1297329 width=0) (actual time=1.721..1.721 rows=5400 loops=1)
               Index Cond: (parallax > '50'::double precision)
   ->  Hash  (cost=118285.38..118285.38 rows=551038 width=1523) (actual time=1885.177..1885.177 rows=550737 loops=1)
         Buckets: 65536  Batches: 16  Memory Usage: 53292kB
         ->  Seq Scan on dr2epochflux  (cost=0.00..118285.38 rows=551038 width=1523) (actual time=0.008..430.692 rows=550737 loops=1)
 Planning time: 6.504 ms
 Execution time: 2733.653 ms

(with 10% or so jitter in the actual times, obviously); this one is
for

  SELECT *
  FROM gaia.dr2epochflux
  JOIN gaia.dr2light
  USING (source_id)
  WHERE parallax>50


While that's a reasonable plan and fast enough, I'd still like to
keep the box from seqscanning dr2epochflux with its large arrays and
use that table's index on source_id.  If I read the query plan right,
part of the problem is that it still massively overestimates the
result of parallax>50 (1297329 rather than 5400).  Is there anything
I can do to improve that estimate?

But even with that suboptimal estimate, postgres, under the
assumption of something not too far from a uniform distribution on
source_id, should end up estimating the cardinality of the end result
as something like

(selectivity on dr2light)*(cardinality of dr2epochflux),

and hence roughly (1297329/1.6e9*5e5)=405 rows to be drawn from
dr2epochflux.  It would seem a lot smarter to just pull these few 1e2
rows using the source_id index on dr2epochflux than seqscanning that
table, no?

Btw., I've raised the statistics target on dr2light to 1000 for
source_id; so, postgres should know source_id isn't uniformly
distributed, but it should also see it's not *that* far away from it.


           -- Markus





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux