Re: AWS RDS PostgreSQL CPU Spiking to 100%

Jeff Janes <jeff.janes@xxxxxxxxx> · Tue, 8 Sep 2020 12:05:24 -0400

On Tue, Sep 8, 2020 at 9:33 AM aditya desai <admad123@xxxxxxxxx> wrote:
Hi,We have an application where one of the APIs calling queries(attached) is spiking the CPU to 100% during load testing.
However, queries are making use of indexes(Bitmap Index and Bitmap Heap scan though). 

The CPU is there to be used.  Anything will use 100% of the CPU unless it runs into some other bottleneck first.

These queries are being called thousands of times. 

Over what time period?  At what concurrency level?

Application team says they have handled connection pooling from the Application side.

Did they do it correctly?  Are you seeing a lot of connections churning through?

1. Does DB need more CPU considering this kind of load? 

Is it currently running fast enough, or does it need to be faster?

2. Can the query be tuned further?

The query you show can't possibly generate the plan you show, so there is no way to know that.

3. Will connection pooling resolve the CPU Spike issues?

Not if the app-side pooling was done correctly.

Also pasting Query and plans below.

----------------------exampleCount 1. Without internalexamplecode-----------------------

lmp_examples=> explain analyze with exampleCount as ( select examplestatuscode from example j where 1=1 and j.countrycode = 'AD'   and j.facilitycode in ('ABCD') and j.internalexamplecode in ('005','006','007','005') and ((j.examplestartdatetime  between '2020-05-18 00:00:00' and '2020-08-19 00:00:00' ) or j.examplestartdatetime IS NULL )  group by j.examplestatuscode)
lmp_examples-> select js.examplestatuscode,COALESCE(count(*),0) stat_count from exampleCount jc right outer join examplestatus js on jc.examplestatuscode=js.examplestatuscode group by js.examplestatuscode ;
                                                                                                                                                                                 QUERY PLAN                                                                                                                                                     
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=79353.80..79353.89 rows=9 width=12) (actual time=88.847..88.850 rows=9 loops=1)
   Group Key: js.examplestatuscode
   CTE examplecount
     ->  HashAggregate  (cost=79352.42..79352.46 rows=4 width=4) (actual time=88.803..88.805 rows=5 loops=1)
           Group Key: j.examplestatuscode
           ->  Bitmap Heap Scan on example j  (cost=1547.81..79251.08 rows=40538 width=4) (actual time=18.424..69.658 rows=62851 loops=1)
                 Recheck Cond: ((((countrycode)::text = 'AD'::text) AND ((facilitycode)::text = 'ABCD'::text) AND ((internalexamplecode)::text = ANY ('{005,006,007,005}'::text[])) AND (examplestartdatetime >= '2020-05-18 00:00:00'::timestamp without time zone) AND (examplestartdatetime <= '2020-08-19 00:00:00'::timestamp without time zone)) OR (examplestartdatetime IS NULL))

Note that the parenthesization of the OR condition is different between the recheck, and the query itself.  So I think that either the query or the plan has not been presented accurately.  Please double check them.

Also, what version of PostgreSQL are you using?  In v12, the CTE gets optimized away entirely.

Cheers,

Jeff