Markus <m@xxxxxxx> writes: > Ah... yeah, the parallax distribution is fairly sharply peaked around > 0, so >50 might be severely off. > So, I've run > alter table gaia.dr2light alter parallax set statistics 1000; > analyze gaia.dr2light; > With this, the query plans converge to trivial variations of > Hash Join (cost=253856.23..4775113.84 rows=422 width=1647) (actual time=1967.095..2733.109 rows=18 loops=1) > Hash Cond: (dr2light.source_id = dr2epochflux.source_id) > -> Bitmap Heap Scan on dr2light (cost=24286.88..4385601.28 rows=1297329 width=132) (actual time=3.113..19.346 rows=5400 loops=1) > -> Hash (cost=118285.38..118285.38 rows=551038 width=1523) (actual time=1885.177..1885.177 rows=550737 loops=1) > While that's a reasonable plan and fast enough, I'd still like to > keep the box from seqscanning dr2epochflux with its large arrays and > use that table's index on source_id. If I read the query plan right, > part of the problem is that it still massively overestimates the > result of parallax>50 (1297329 rather than 5400). Is there anything > I can do to improve that estimate? Raise the parallax stats target still higher, perhaps. I think we let you set it as high as 10000. > But even with that suboptimal estimate, postgres, under the > assumption of something not too far from a uniform distribution on > source_id, should end up estimating the cardinality of the end result > as something like > (selectivity on dr2light)*(cardinality of dr2epochflux), > and hence roughly (1297329/1.6e9*5e5)=405 rows to be drawn from > dr2epochflux. It would seem a lot smarter to just pull these few 1e2 > rows using the source_id index on dr2epochflux than seqscanning that > table, no? No. Given the above estimates, it would have to do 1297329 index lookups in dr2epochflux, which is not going to be a win compared to 1297329 probes into an in-memory hash table. Even with a dead-accurate estimate of 5400 dr2light rows to be joined, I don't think an inner indexscan is necessarily a better plan than a hash. It's the number of probes that matter, not the number of successful probes. (It's not clear to me why so few of the dr2light rows have join partners, but the planner does seem to understand that most of them don't.) You say you're worried about "large arrays" in dr2epochflux; but if they are large enough to be toasted out-of-line, it's really a nonissue. Only the toast pointers would be read during the seqscan or stored in the hash. regards, tom lane