Re: Ye olde slow query

"Murphy, Kevin" <MURPHYKE@xxxxxxxxxxxxxx> · Mon, 14 Apr 2014 13:55:59 +0000

Sorry for the delay; back on this, and thanks for the response.

On Mar 11, 2014, at 6:23 PM, Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> "Murphy, Kevin" <MURPHYKE@xxxxxxxxxxxxxx> writes:
>> Synopsis: 8-table join with one "WHERE foo IN (...)" condition; works OK with fewer
>> than 5 items in the IN list, but at N=5, the planner starts using a compound index 
>> for the first time that completely kills performance (5-6 minutes versus 0-12 seconds).
>> […]
> 
> FWIW, I think the right question here is not "why is the slow query
> slow?", but "why is the fast query fast?”.
> […]
> It looks like the trouble spot is this intermediate nested loop:
> 
>>                    ->  Nested Loop  (cost=4.32..283545.98 rows=80929 width=12) (actual time=163.609..571.237 rows=102 loops=1)
>>                          Buffers: shared hit=419 read=63
>>                          ->  Nested Loop  (cost=4.32..3426.09 rows=471 width=4) (actual time=93.595..112.404 rows=85 loops=1)
>>                              ...
>>                          ->  Index Scan using sample_result_variant_id on sample_result  (cost=0.00..593.01 rows=172 width=8) (actual time=5.147..5.397 rows=1 loops=85)
>>                                Index Cond: (variant_id = variant_effect.variant_id)
>>                                Buffers: shared hit=400 read=42
> 
> which is creating the bulk of the estimated cost for the whole plan,
> but execution is actually pretty cheap.  There seem to be two components
> to the misestimation: one is that the sub-nested loop is producing about a
> fifth as many rows as expected,

This may be because 3 out of the 4 user-supplied gene symbols were not present in the gene table at all.  Restricting to valid genes prior to the query is probably a good idea.

> and the other is that the probes into
> sample_result are producing (on average) 1 row, not the 172 rows the
> planner expects.  If you could get the latter estimate to be even within
> one order of magnitude of reality, the planner would certainly see this
> plan as way cheaper than the other.

I’m not sure about how to improve this.  The stats were 5K globally and up to date, and I made them better, with no change.  I tried increasing the stats on the foreign keys involved to 10K (and analyzing), but the same costs and plan are in play.  I know the stats are updated now because I dumped and restored on new hardware and did a vacuum analyze.  I previously mentioned that some of the vanilla n_distinct values were way off for the (790M row) sample_result table, so I have taken to coercing n_distinct using negative multipliers.  This data doesn’t change very often (it hasn’t in many weeks).

There are 6M variants, but only 7.5% of them map to the sample_result table.  Presumably the planner knows this because of the n_distinct value on sample_result.variant_id?  Each variant maps to zero or sample_result records, but often very few, and never more than the number of samples (currently 1129).

> 
> So I'm wondering if the stats on sample_result and variant_effect are up
> to date.  If they are, you might try increasing the stats targets for the
> variant_id columns.

The stats were up to date and were at 5K globally.  I tried increasing the stats on the foreign keys involved to 10K (and analyzing!), but the same costs and plan are in play.  I know the stats are updated now because I dumped and restored on new hardware and did a vacuum analyze.  I previously mentioned that some of the vanilla n_distinct values were way off for the (790M row) sample_result table, so I have taken to coercing n_distinct using negative multipliers.

Regards,
Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance