RE: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

"ldh@xxxxxxxxxxxxxxxxxx" <ldh@xxxxxxxxxxxxxxxxxx> · Sat, 21 Aug 2021 08:55:37 +0000

OK… I apologize for the long email before. Right after I sent it, I thought of a much simpler use-case to illustrate the issue which doesn’t depend on any special data I have access o and complex pivoting. It’s as raw as I can make it.

I create a table with 1M rows and 2 columns. Column “a” is a random string, while column “b” is a random integer as a string. Then I use a UDF that converts strings to floats and handles an exception if the incoming string is not parsable
 as a float. Then I do a simple select of each column. In the “a” case, the UDF throws and catches lots of exceptions. In the “b” case, the conversion is clean and exceptions are not thrown.

create table sampletest (a varchar, b varchar);

insert into sampletest (a, b)
select substr(md5(random()::text), 0, 15), (100000000*random())::integer::varchar
from generate_series(1,1000000);

CREATE OR REPLACE FUNCTION toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;

select MAX(toFloat(a, null)) as "a" from sampletest;

select MAX(toFloat(b, null)) as "b" from sampletest;

On purpose, I am doing a max(toFloat) instead of toFloat(max) to exercise the UDF 1M times.

V13.4 “a” scenario (exceptions)
-------------------------------------------------------------
Aggregate  (cost=14778.40..14778.41 rows=1 width=4) (actual time=774098.537..774098.538 rows=1 loops=1)
  Buffers: shared hit=6373
  ->  Seq Scan on sampletest  (cost=0.00..11975.60 rows=560560 width=32) (actual time=0.011..285.458 rows=1000000 loops=1)
        Buffers: shared hit=6370
Planning Time: 0.066 ms
Execution Time: 774,098.563 ms

V13.4 “b” scenario (no exceptions)
-------------------------------------------------------------
Aggregate  (cost=14778.40..14778.41 rows=1 width=4) (actual time=1510.200..1510.201 rows=1 loops=1)
  Buffers: shared hit=6385
  ->  Seq Scan on sampletest  (cost=0.00..11975.60 rows=560560 width=32) (actual time=0.024..115.196 rows=1000000 loops=1)
        Buffers: shared hit=6370
Planning:
  Buffers: shared hit=26
Planning Time: 0.361 ms
Execution Time: 1,530.659 ms

V11.2 “a” scenario (exceptions)
-------------------------------------------------------------
Aggregate  (cost=21658.00..21658.01 rows=1 width=4) (actual time=26528.286..26528.286 rows=1 loops=1)
  Buffers: shared hit=6393
  ->  Seq Scan on sampletest  (cost=0.00..16562.00 rows=1019200 width=15) (actual time=0.037..190.633 rows=1000000 loops=1)
        Buffers: shared hit=6370
Planning Time: 1.182 ms
Execution Time: 26,530.492 ms

V11.2 “b” scenario (no exceptions)
-------------------------------------------------------------
Aggregate  (cost=21658.00..21658.01 rows=1 width=4) (actual time=1856.116..1856.116 rows=1 loops=1)
  Buffers: shared hit=6370
  ->  Seq Scan on sampletest  (cost=0.00..16562.00 rows=1019200 width=8) (actual time=0.014..88.152 rows=1000000 loops=1)
        Buffers: shared hit=6370
Planning Time: 0.098 ms
Execution Time: 1,856.152 ms

Summary:

Scenario V11.2/a: 26.6s
Scenario V11.2/b: 1.9s
Scenario V13.4/a: 774.1s
Scenario V13.4/b: 1.5s

Conclusion:

The no-exception scenario performs 20% better on 13.4 vs 11.2 (nice for a straight scan!)
On 11.2, exceptions add an overhead of over 14x (1.9s vs 26.6s). I did not expect exceptions to add such a large overhead. Why is that?
Between 11.2 and 13.4, the no-exceptions scenario “b” performs 30x slower (26.6s vs 774.1s).

Thank you!
Laurent Hasson.

From: ldh@xxxxxxxxxxxxxxxxxx <ldh@xxxxxxxxxxxxxxxxxx>

Sent: Saturday, August 21, 2021 03:57

To: pgsql-performance@xxxxxxxxxxxxxx

Subject: Big Performance drop of Exceptions in UDFs between V11.2 and 13.4

Hello all,

I think I have identified a major performance issue between V11.2 and 13.4 with respect to exception handling in UDFs. I have the following simplified query that pivots data and makes use of a UDF to convert data
 to a specific type, in this case, float:

select "iccqa_iccassmt_fk" 
     , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'DEPTH (CM)') ,null) as "iccqa_DEPTH_CM"
     , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'LENGTH (CM)') ,null) as "iccqa_LENGTH_CM"
     , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'WIDTH (CM)') ,null) as "iccqa_WIDTH_CM"
     , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'DRAIN PRESENT') ,null) as "iccqa_DRAIN_PRESENT"
     , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'MEASUREMENTS TAKEN') ,null) as "iccqa_MEASUREMENTS_TAKEN"
     , Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'SIGNS AND SYMPTOMS OF INFECTION') ,null) as "iccqa_SIGNS_AND_SYMPTOMS_OF_INFECTION"
from  (
-- 'A pivoted view of ICC QA assessments'
select VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_iccassmt_fk" as "iccqa_iccassmt_fk" -- The key identifying an ICC assessment.
     , VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ques_code" as "iccqar_ques_code" -- The question long code from the meta-data.
     , max(VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ans_val") as "iccqar_ans_val" -- The official answer, if applicable) from the meta-data.
  from VNAHGEDW_FACTS.AssessmentICCQA_Raw
where VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ques_code" in ('DEPTH (CM)', 'LENGTH (CM)', 'WIDTH (CM)'
                                                               , 'DRAIN PRESENT', 'MEASUREMENTS TAKEN', 'SIGNS AND SYMPTOMS OF INFECTION'
                                                               ) group by 1, 2
) T
     group by 1
;

The UDF is simple as follows:

CREATE OR REPLACE FUNCTION TILDA.toFloat(str varchar, val real)
RETURNS real AS $$
BEGIN
  RETURN case when str is null then val else str::real end;
EXCEPTION WHEN OTHERS THEN
  RETURN val;
END;
$$ LANGUAGE plpgsql COST 1 IMMUTABLE;

It works as a coalesce but with a conversion. I think I have identified some large performance difference with the exception handling. It so happens that with the last 3 columns ('DRAIN PRESENT', 'MEASUREMENTS TAKEN'
 and 'SIGNS AND SYMPTOMS OF INFECTION'), the data is VERY dirty. There is a mix of 0/1, YES/NO, and other mistyped stuff. This means these 3 columns throw lots of exceptions in the UDF. To illustrate, I simply break this into 2 queries.

select "iccqa_iccassmt_fk" 
     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'DEPTH (CM)') ,null))::real as "iccqa_DEPTH_CM"
     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'LENGTH (CM)') ,null))::real as "iccqa_LENGTH_CM"
     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'WIDTH (CM)') ,null))::real as "iccqa_WIDTH_CM"
--     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'DRAIN PRESENT') ,null))::real as "iccqa_DRAIN_PRESENT"
--     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'MEASUREMENTS TAKEN') ,null))::real as "iccqa_MEASUREMENTS_TAKEN"
--     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'SIGNS AND SYMPTOMS OF INFECTION') ,null))::real as "iccqa_SIGNS_AND_SYMPTOMS_OF_INFECTION"
from  (
-- 'A pivoted view of ICC QA assessments'
select VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_iccassmt_fk" as "iccqa_iccassmt_fk" -- The key identifying an ICC assessment.
     , VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ques_code" as "iccqar_ques_code" -- The question long code from the meta-data.
     , max(VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ans_val") as "iccqar_ans_val" -- The official answer, if applicable) from the meta-data.
  from VNAHGEDW_FACTS.AssessmentICCQA_Raw
where VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ques_code" in ('DEPTH (CM)', 'LENGTH (CM)', 'WIDTH (CM)'
                                                               , 'DRAIN PRESENT', 'MEASUREMENTS TAKEN', 'SIGNS AND SYMPTOMS OF INFECTION'
                                                               )
group by 1, 2
) T
     group by 1
;

The performance is as expected.

HashAggregate  (cost=448463.70..448467.20 rows=200 width=16) (actual time=6760.797..9585.397 rows=677899 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 147489kB
  Buffers: shared hit=158815
  ->  HashAggregate  (cost=405997.87..417322.09 rows=1132422 width=56) (actual time=4576.514..5460.770 rows=2374628 loops=1)
        Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, assessmenticcqa_raw.iccqar_ques_code
        Batches: 1  Memory Usage: 368657kB
        Buffers: shared hit=158815
        ->  Seq Scan on assessmenticcqa_raw  (cost=0.00..388224.53 rows=2369779 width=38) (actual time=0.033..3298.544 rows=2374628 loops=1)
              Filter: ((iccqar_ques_code)::text = ANY ('{"DEPTH (CM)","LENGTH (CM)","WIDTH (CM)","DRAIN PRESENT","MEASUREMENTS TAKEN","SIGNS AND SYMPTOMS OF INFECTION"}'::text[]))
              Rows Removed by Filter: 10734488
              Buffers: shared hit=158815
Planning:
  Buffers: shared hit=3
Planning Time: 0.198 ms
Execution Time: 9678.120 ms

However, once we switch with the three “bad” columns, the results fall apart.

select "iccqa_iccassmt_fk" 
--     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'DEPTH (CM)') ,null))::real as "iccqa_DEPTH_CM"
--     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'LENGTH (CM)') ,null))::real as "iccqa_LENGTH_CM"
--     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'WIDTH (CM)') ,null))::real as "iccqa_WIDTH_CM"
     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'DRAIN PRESENT') ,null))::real as "iccqa_DRAIN_PRESENT"
     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'MEASUREMENTS TAKEN') ,null))::real as "iccqa_MEASUREMENTS_TAKEN"
     , (Tilda.toFloat(MAX("iccqar_ans_val") filter (where "iccqar_ques_code"= 'SIGNS AND SYMPTOMS OF INFECTION') ,null))::real as "iccqa_SIGNS_AND_SYMPTOMS_OF_INFECTION"
from  (
-- 'A pivoted view of ICC QA assessments'
select VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_iccassmt_fk" as "iccqa_iccassmt_fk" -- The key identifying an ICC assessment.
     , VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ques_code" as "iccqar_ques_code" -- The question long code from the meta-data.
     , max(VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ans_val") as "iccqar_ans_val" -- The official answer, if applicable) from the meta-data.
  from VNAHGEDW_FACTS.AssessmentICCQA_Raw
where VNAHGEDW_FACTS.AssessmentICCQA_Raw."iccqar_ques_code" in ('DEPTH (CM)', 'LENGTH (CM)', 'WIDTH (CM)'
                                                               , 'DRAIN PRESENT', 'MEASUREMENTS TAKEN', 'SIGNS AND SYMPTOMS OF INFECTION'
                                                               )
group by 1, 2
) T
     group by 1
;

The performance falls apart. It is a huge performance difference from ~10s to ~11mn and the only difference that I can think of is that the data is dirty which causes the exception path to be taken. The explain
 is:

HashAggregate  (cost=448463.70..448467.20 rows=200 width=16) (actual time=6672.921..696753.080 rows=677899 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 131105kB
  Buffers: shared hit=158815
  ->  HashAggregate  (cost=405997.87..417322.09 rows=1132422 width=56) (actual time=4574.918..5446.022 rows=2374628 loops=1)
        Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, assessmenticcqa_raw.iccqar_ques_code
        Batches: 1  Memory Usage: 368657kB
        Buffers: shared hit=158815
        ->  Seq Scan on assessmenticcqa_raw  (cost=0.00..388224.53 rows=2369779 width=38) (actual time=0.032..3300.616 rows=2374628 loops=1)
              Filter: ((iccqar_ques_code)::text = ANY ('{"DEPTH (CM)","LENGTH (CM)","WIDTH (CM)","DRAIN PRESENT","MEASUREMENTS TAKEN","SIGNS AND SYMPTOMS OF INFECTION"}'::text[]))
              Rows Removed by Filter: 10734488
              Buffers: shared hit=158815
Planning:
  Buffers: shared hit=3
Planning Time: 0.201 ms
Execution Time: 696868.845 ms

Now, on V11.2, the explain is:

HashAggregate  (cost=492171.36..492174.86 rows=200 width=16) (actual time=19322.522..50556.738 rows=743723 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Buffers: shared hit=11 read=174155 dirtied=13
  ->  HashAggregate  (cost=445458.43..457915.21 rows=1245678 width=56) (actual time=16260.015..17575.088 rows=2601088 loops=1)
        Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, assessmenticcqa_raw.iccqar_ques_code
        Buffers: shared read=174155 dirtied=13
        ->  Seq Scan on assessmenticcqa_raw  (cost=0.00..425803.93 rows=2620600 width=38) (actual time=0.126..14425.239 rows=2601088 loops=1)
              Filter: ((iccqar_ques_code)::text = ANY ('{"DEPTH (CM)","LENGTH (CM)","WIDTH (CM)","DRAIN PRESENT","MEASUREMENTS TAKEN","SIGNS AND SYMPTOMS OF INFECTION"}'::text[]))
              Rows Removed by Filter: 11778360
              Buffers: shared read=174155 dirtied=13
Planning Time: 36.121 ms
Execution Time: 50730.255 ms

So, we are seeing two issues:

-         
I think exception handling is significantly slower between V11.2 and v13.4. I see almost a 14x difference from 50s to 700s.

-         
Comparing the two queries on V11.2, the difference is 13s vs 50s. So even on V11.2, the exception handling adds a significant overhead which I was not expecting.

I’ll be happy to update my test cases and share additional info if needed.

Thank you,
Laurent Hasson.