Re: Slow query and wrong row estimates for CTE

Dane Foster <studdugie@xxxxxxxxx> · Wed, 17 Feb 2021 15:32:23 -0500

On Wed, Feb 17, 2021 at 1:37 PM Yoan SULTAN <yeyeyo@xxxxxxxxx> wrote:
Hi all, 

This is my first post on this mailing list, I really enjoy it.
I wanted to add some details and answers to this disccusion.
I'm happy you've decided to join the conversation and about the fact that you've opened up an entirely new avenue for me to investigate and learn from. I feel like I'm about to level up my SQL-fu! 😊

 17 févr. 2021 à 17:52, Dane Foster <studdugie@xxxxxxxxx> a écrit :

A small update (see below/inline).

On Tue, Feb 16, 2021 at 2:11 PM Dane Foster <studdugie@xxxxxxxxx> wrote:
Short conclusion:
Switching from CTEs to temporary tables and analyzing reduced the runtime from 15 minutes to about 1.5 minutes.

The attempt_scores table is pretty big and it is called 3 times; try to rewrite the query in order to reduce call to this table. for example :
EXPLAIN (ANALYZE, BUFFERS)
WITH reports AS (
  SELECT student_id, assignment_id, max(score_value) FILTER (WHERE score_name = 'student_performance_index'),
                                    max(attempt_report_id) maxid, 
                                    max(score_value) spi
  FROM attempt_scores
  GROUP BY student_id, assignment_id
  HAVING max(score_value) > 0 AND max(score_value) FILTER (WHERE score_name = 'student_performance_index') = max(score_value)
)
SELECT
  avg(spi) spi,
  avg(CASE score_name WHEN 'digital_clinical_experience' THEN score_value END) dce,
  avg(CASE score_name WHEN 'tier1_subjective_data_collection' THEN score_value END) sdc
FROM attempt_scores JOIN reports ON reports.maxid=attempt_scores.attempt_report_id;
Given: HAVING max(score_value) > 0 AND max(score_value) FILTER (WHERE score_name = 'student_performance_index') = max(score_value)

Why: max(score_value) FILTER (WHERE score_name = 'student_performance_index') but no FILTER clause on: max(attempt_report_id)?

Some context for my question. I'm new to aggregate expressions therefore I don't have a strong mental model for what's happening. So let me tell you what I think is happening and you can correct me.

 The new HAVING clause that you've added ensures that for each student/assignment pair/group that we are selecting the max spi value (i.e., score_name = 'student_performance_index'). Therefore, isn't the FILTER clause in the SELECT section redundant? And if it's not redundant then why isn't it necessary for: max(attempt_report_id)?

Also, I would continue to increase work_mem to 200MB until the external merge is not required.
SET WORK_MEM='200MB'; -- to change only at session level

Longer conclusion:

@Justin Pryzby
I experimented w/ materializing the CTEs and it helped at the margins but did not significantly contribute to a reduction in runtime.
No clustering was required because once I switched to temporary tables the new plan no longer used the for_upsert index.
@Michael Lewis
Increasing work_mem to 100MB (up from 4MB) helped at the margins (i.e., some 100's of millisecond improvement) but did not represent a significant reduction in the runtime.
It wasn't obvious to me which window function would be appropriate for the problem I was trying to solve therefore I didn't experiment w/ that approach.
I want to update/correct this statement: 
The selectivity of score_name='student_performance_index' was not enough for the planner to choose an index over doing a FTS.
I added a partial index (WHERE score_name='student_performance_index'::citext) and that had a dramatic impact. That part of the query went from ~12 seconds to ~1 second.

Another way to generate perf. gains on this query,  CREATE HASH INDEX ON attempt_scores(score_name); --since score_name doesn't seem to have a big cardinality

Finally, thank you both for helping me bring this poor performing query to heel. Your insights were helpful and greatly appreciated.

Sincerely,

Dane

On Tue, Feb 16, 2021 at 10:25 AM Dane Foster <studdugie@xxxxxxxxx> wrote:

On Tue, Feb 16, 2021 at 10:13 AM Michael Lewis <mlewis@xxxxxxxxxxx> wrote:
                                                   Sort Method: external merge  Disk: 30760kB
                                                   Worker 0:  Sort Method: external merge  Disk: 30760kB
                                                   Worker 1:  Sort Method: external merge  Disk: 30760kB

If you can increase work_mem, even setting it temporarily higher for the session or transaction, that may dramatically change the plan.
I will try increasing work_mem for the session later today.
 The advice given by Justin particularly about row estimates  would be wise to pursue.

 I'd wonder how selective that condition of score_name = 'student_performance_index' is in filtering out many of the 9.3 million tuples in that table and if an index with that as the leading column, or just an index on that column would be helpful.
There are 1,206,355 rows where score_name='student_performance_idex'.
 You'd need to look at pg_stats for the table and see how many distinct values, and if student_performance_index is relatively high or low (or not present) in the MCVs list.
I will look into that.

I am not sure if your query does what you want it to do as I admit I didn't follow your explanation of the desired behavior. My hunch is that you want to make use of a window function and get rid of one of the CTEs.
If you could tell me what part(s) are unclear I would appreciate it so that I can write a better comment. 

Thank you sooo much for all the feedback. It is greatly appreciated!
Sincerely,

Dane

-- 
Regards,
Yo.
Again, thanks for joining the conversation. I look forward to hearing from you.

Sincerely,

Dane