Hi Tom, On Fri, May 04, 2018 at 09:32:08AM -0400, Tom Lane wrote: > Markus <m@xxxxxxx> writes: > > I'm running Postgresql 9.6.7 on Debian stretch -- and I'm trying to > > understand a query plan, with any hint where to gain further insight > > welcome. > > Well, you say > > > select count(*) from gaia.dr2light where parallax>50; > > gives 5400 rows in no time. > > but the planner thinks there are 12991627 such rows: Ah... yeah, the parallax distribution is fairly sharply peaked around 0, so >50 might be severely off. So, I've run alter table gaia.dr2light alter parallax set statistics 1000; analyze gaia.dr2light; and lo and behold, the both queries become a good deal faster (a couple of seconds). That's essentially enough to make me happy -- thanks! > Also, this sort of thing is usually much easier to diagnose from > EXPLAIN ANALYZE output. All we can see from these queries is that > the planner picked what it thought was the lowest-cost plan. Without > actual rowcounts it's very hard to guess why the estimates were wrong. > You happened to provide one actual-rowcount number that maybe was > enough to diagnose the issue; but if the above doesn't do the trick, > we're going to need to see EXPLAIN ANALYZE to guess what else is up. With this, the query plans converge to trivial variations of Hash Join (cost=253856.23..4775113.84 rows=422 width=1647) (actual time=1967.095..2733.109 rows=18 loops=1) Hash Cond: (dr2light.source_id = dr2epochflux.source_id) -> Bitmap Heap Scan on dr2light (cost=24286.88..4385601.28 rows=1297329 width=132) (actual time=3.113..19.346 rows=5400 loops=1) Recheck Cond: (parallax > '50'::double precision) Heap Blocks: exact=5184 -> Bitmap Index Scan on dr2light_parallax (cost=0.00..23962.54 rows=1297329 width=0) (actual time=1.721..1.721 rows=5400 loops=1) Index Cond: (parallax > '50'::double precision) -> Hash (cost=118285.38..118285.38 rows=551038 width=1523) (actual time=1885.177..1885.177 rows=550737 loops=1) Buckets: 65536 Batches: 16 Memory Usage: 53292kB -> Seq Scan on dr2epochflux (cost=0.00..118285.38 rows=551038 width=1523) (actual time=0.008..430.692 rows=550737 loops=1) Planning time: 6.504 ms Execution time: 2733.653 ms (with 10% or so jitter in the actual times, obviously); this one is for SELECT * FROM gaia.dr2epochflux JOIN gaia.dr2light USING (source_id) WHERE parallax>50 While that's a reasonable plan and fast enough, I'd still like to keep the box from seqscanning dr2epochflux with its large arrays and use that table's index on source_id. If I read the query plan right, part of the problem is that it still massively overestimates the result of parallax>50 (1297329 rather than 5400). Is there anything I can do to improve that estimate? But even with that suboptimal estimate, postgres, under the assumption of something not too far from a uniform distribution on source_id, should end up estimating the cardinality of the end result as something like (selectivity on dr2light)*(cardinality of dr2epochflux), and hence roughly (1297329/1.6e9*5e5)=405 rows to be drawn from dr2epochflux. It would seem a lot smarter to just pull these few 1e2 rows using the source_id index on dr2epochflux than seqscanning that table, no? Btw., I've raised the statistics target on dr2light to 1000 for source_id; so, postgres should know source_id isn't uniformly distributed, but it should also see it's not *that* far away from it. -- Markus