Re: query not scaling

Rob Sargent <robjsargent@xxxxxxxxx> · Tue, 31 Oct 2017 10:16:14 -0600

On 10/31/2017 03:12 AM, Laurenz Albe wrote:
Rob Sargent wrote:
I think your biggest problem is the join condition
     on m.basepos between s.startbase and s.endbase

That forces a nested loop join, which cannot be performed efficiently.
Agree! 800,000 * 4,000 = 3,200,000,000.  It's just that I thought I had
corralled that problem which indexing but apparently not.  I was hoping
some kind soul might point out a missing index or similar correction. I
have completely reworked the process, but not sure yet if it's correct.
(The slow answer is correct, once it comes in.)
You can create indexes that are useful for this query:

    ON sui.segment(chrom, markerset_id)
    ON sui.probandset(people_id)

But that probably won't make a big difference, because the sequential
scans take only a small fraction of your query time.

A little less than half of the query time is spent in the nested loop
join, and a little more than half of the time is spent doing the
GROUP BY.

Perhaps the biggest improvement you can easily make would be to
get rid of "numeric" for the computation.  I suspect that this is
where a lot of time is spent, since the hash aggregate is over
less than 15000 rows.

Unless you really need the precision of "numeric", try

CREATE OR REPLACE FUNCTION pv(l bigint, e bigint, g bigint, o int)
    RETURNS double precision LANGUAGE sql AS
$$SELECT (g+e+o)::double precision / (l+e+g+o)::double precision$$;

Yours,
Laurenz Albe
In practice markersets are always aligned with one chromosome so I would 
not expect this to have an effect.  There's no constraint on this 
however, and there can be more than one markerset per chromosome.  I 
have played with indexing on segment.markerset_id.

In all the data sets used in the examples (runtimes, explains etc) there 
has been a in single people_id across the existing segment data.  Down 
the road this of course will not be the case and I can see the value of 
an index on probandset.people_id eventually. I can certainly add it now 
for a test.  I'm currently writing a probandset loader hoping to get a 
test case for the problem with gin indexing mentioned up-thread.

I think I'm most surprise at the notion that the arithmetic is the 
problem and will happily test your suggestion to force floating point 
values.  The value can get small (10^-12 on a good day!) but we don't 
need many digits of precision.

Thanks

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general