Re: How to improve the performance of my SQL query？

gzh <gzhcoder@xxxxxxx> · Fri, 28 Jul 2023 19:50:29 +0800 (CST)

Thank you very much for taking the time to reply to my question. 

> The problem is none of the explains you sent match with the description above. The last one when you forced the optimizer to go with index scan (SET enable_seqscan TO off;) the chosen index seems to be one composed by ms_cd and cd_ate (following your standard of 2 characters column name). There may have a couple of explanations to this:
> - One is that the index may not be exactly the same as described above;
The primary key of the tbl_sha table consists of several fields, and ms_cd is just one of them. I just explained the definitions of the fields used in the query SQL.

> - Another one is the order in the index. Because you have a composed index the order of the columns in the index matters, and it seems the order is (ms_cd, et_cd, etrys). I wonder if you could recreate this index with the following order: (ms_cd, etrys, et_cd) and run the same query;
The index of TBL_SHA table is defined as follows.

CREATE INDEX index_search_01 ON mdb.TBL_SHA USING btree (MS_CD, ETRYS)
CREATE INDEX index_search_02 ON mdb.TBL_SHA USING btree (ET_CD, ETRYS)
CREATE INDEX index_search_03 ON mdb.TBL_SHA USING btree (MS_CD, ET_DAY, BK_CD, FR_CD, RM_CD)

> You can also try to trick the optimizer, for example, what is the result (and explain) of the below query?
> 
> WITH tbi (ry_cd) AS (
>     SELECT tbl_inf.ry_cd
>     FROM tbl_inf tbi
>     WHERE tbi.ms_cd = 'MLD009'
>         AND tbl_inf.ry_cd = '00000001'
> ) SELECT COUNT(et_cd) FROM tbl_sha tbs
>     JOIN tbi ON tbi.ry_cd = tbs .etrys
>     WHERE tbs .ms_cd = 'MLD009';
The SQL execution encountered an error, so I made some modifications.
Please refer to the execution plan.
     WHERE tbi.ms_cd = 'MLD009'
→
     WHERE tbl_inf.ms_cd = 'MLD009'

QUERY PLAN
Limit  (cost=2668811.76..2668811.77 rows=1 width=8) (actual time=133555.074..133557.729 rows=1 loops=1)
  ->  Aggregate  (cost=2668811.76..2668811.77 rows=1 width=8) (actual time=133555.072..133557.726 rows=1 loops=1)
        ->  Nested Loop  (cost=1000.29..2664512.83 rows=1719572 width=9) (actual time=29657.638..133341.053 rows=2113500 loops=1)
              ->  Index Only Scan using tbl_inf_pkc on tbl_inf  (cost=0.29..8.31 rows=1 width=9) (actual time=1.316..1.321 rows=1 loops=1)
                    Index Cond: ((ms_cd = 'MLD009'::bpchar) AND (ry_cd = '00000001'::bpchar))
                    Heap Fetches: 1
              ->  Gather  (cost=1000.00..2647308.80 rows=1719572 width=18) (actual time=29656.318..132969.910 rows=2113500 loops=1)
                    Workers Planned: 2
                    Workers Launched: 2
                    ->  Parallel Seq Scan on tbl_sha tbs  (cost=0.00..2474351.60 rows=716488 width=18) (actual time=29654.184..132876.292 rows=704500 loops=3)
                          Filter: ((ms_cd = 'MLD009'::bpchar) AND (etrys = '00000001'::bpchar))
                          Rows Removed by Filter: 14678996
Planning Time: 0.164 ms
Execution Time: 133557.767 ms

> Well, adding more resources tends to improve performance, but it's usually not linear and the improvement may not be as large as you want for the extra price you are paying. I would first try to understand the performance problem because using the "add more resources" approach may just delay the problem and it tends to get worse with time as the dataset increases.
I strongly agree with your viewpoint, but I currently don't have a solution in mind for the problem.

At 2023-07-28 04:38:39, "Charly" <carlbsb@xxxxxxxxx> wrote:
Hi "gzh",

Based on the info you provided I'm assuming you are trying to use the TBL_SHA primary key to do an index-only scan as in you mentioned above you have:
> TBL_SHA
> ms_cd character(6) NOT NULL       -- PRIMARY KEY
> et_cd character(8)
> etrys character(8)
Assuming a composed index here by the 3 columns.

> TBL_INF
> ms_cd character(6) NOT NULL       -- PRIMARY KEY
> ry_cd character(8) NOT NULL       -- PRIMARY KEY
Here it's more clear that there is a composed index based on those 2 columns.

The problem is none of the explains you sent match with the description above. The last one when you forced the optimizer to go with index scan (SET enable_seqscan TO off;) the chosen index seems to be one composed by ms_cd and cd_ate (following your standard of 2 characters column name). There may have a couple of explanations to this:
 - One is that the index may not be exactly the same as described above;
 - Another one is the order in the index. Because you have a composed index the order of the columns in the index matters, and it seems the order is (ms_cd, et_cd, etrys). I wonder if you could recreate this index with the following order: (ms_cd, etrys, et_cd) and run the same query;
 There may be other problems happening there, but those are the ones I see more evident from the description of the problem.

Giving a closer look to your query I really didn't understand the reasoning to have that subselect as it seems only to validate the two relations have common items or to validate a parent relationship, in this case you can probably use a join to get the same result:

SELECT COUNT(et_cd) FROM tbl_sha tbs
    JOIN tbi ON tbi.ry_cd = tbs.etrys AND tbi.ms_cd = tbi.ms_cd
WHERE tbi.ms_cd = 'MLD009' 
    AND tbl_inf.ry_cd = '00000001';

You can also try to trick the optimizer, for example, what is the result (and explain) of the below query?

WITH tbi (ry_cd) AS (
    SELECT tbl_inf.ry_cd
    FROM tbl_inf tbi
    WHERE tbi.ms_cd = 'MLD009'
        AND tbl_inf.ry_cd = '00000001'
) SELECT COUNT(et_cd) FROM tbl_sha tbs 
    JOIN tbi ON tbi.ry_cd = tbs .etrys
    WHERE tbs .ms_cd = 'MLD009';

You can alternatively try the CTE using the both columns in the JOIN clause.

On Thu, 27 Jul 2023 at 05:10, gzh <gzhcoder@xxxxxxx> wrote:
Thank you for your suggestion. 
The database is hosted on AWS RDS, with the instance having vCPU=2 and RAM=4GB. 
Could the low performance be due to the low configuration of AWS RDS? 
We are considering trying a higher configuration instance.

Well, adding more resources tends to improve performance, but it's usually not linear and the improvement may not be as large as you want for the extra price you are paying. I would first try to understand the performance problem because using the "add more resources" approach may just delay the problem and it tends to get worse with time as the dataset increases. 

I hope it helps.

-- 
Regards,

Charly Batista
Shanghai, China
carlbsb@xxxxxxxxx 
https://www.linkedin.com/in/charlybatista/
Linux user #391083

“If you have an apple and I have an apple and we exchange these apples then you and I will still each have one apple. But if you have an idea and I have an idea and we exchange these ideas, then each of us will have two ideas."
      George Bernard Shaw (1856-1950)