On Thu, Feb 14, 2019 at 12:43 PM Michael Lewis <mlewis@xxxxxxxxxxx> wrote:
How many total rows in these tables? I am assuming these are partitions and those 100 million rows got distributed. If the data difference is significant, then you aren't guaranteed similar performance. You may want to follow more of the suggested steps on.
Michael - Yes correct the data of 100 million rows is distributed to all the partitions.
FInally I feel we have come to conclusion after we changed the random_page_cost from 4 to 2 in Non prod instance and we see improvements in the query which use to take from 1 minute to 1 -2 seconds.
That's correct we have around 490 million rows in few of our partition tables. The partition tables are created for every 7 days.
We ran our API test which hits the backend database Aurora RDS PostgreSQL and see our query response times , requests/sec are better than before. Do you recommend this setting on a Production instance? Right now we are planning to go
implement this option of random_page_cost to 2. ( We know it also works if we turn off the hash_join but we dont want to implement this change but rather use random_page_cost to 2).
Questions.
1) What is your recommendation on this ? Can we modify this change on Prod instance which is performing better today or only keep this change to Non prod instance ?. ( Actually we want to implement this change on Non Prod instance)
2) Are there are any downside in changing this value ?.
3) Also another question we have a 64TB storage and running on r4 16 x large EC2 instance. Few of our partitions are bigger and few of them are smaller. We have data from last 9 months and we are planning to keep the data for about
close to 1 year till May. Do you see any concerns on this ?. Eventually we are thinking to archive this data in next 2 months by dropping of older partitions.
3) What could be the problems of keeping the data longer if there is a regular maintenance like VACUUM and other maintenace activities
On Thu, Feb 14, 2019 at 8:48 AM github kran <githubkran@xxxxxxxxx> wrote:On Wed, Feb 13, 2019 at 11:38 AM Michael Lewis <mlewis@xxxxxxxxxxx> wrote:I didn't see your email yesterday, sorry about that. Index scans instead of sequential scans and nested loop instead of hash join means that you have bad row count estimates on "Non prod Aurora RDS instance" as far as I can figure. Have you run commands like-analyze asset_info_2019_2_part4;analyze asset_info_2019_2_part2;etc? If data are very similar, indexes all exist, and default_statistics_target are the same, then you should be getting the same plans.1) Yes did the VACUUM for all the tables like asset_info_2019_2_part1, part2 , part4 and also for location_info_xxx to remove any dead tuples and also rebuilt the indexes..2) REINDEX table location_data_2018_12_part4;( Like wise ran for all the tables and also VACUUM for all the tables).3) The data in Non prod instance is more. One thing to mention here when we built the Non prod instance we copied SNAPSHOT from Prod instance and on top of that inserted data about 100 million rows and then did VACUUM and re-indexed the tables.I cant think of anything we can do here but let us know if you need any more details on this problem. Iam happy to share more details.On Wed, Feb 13, 2019 at 8:49 AM github kran <githubkran@xxxxxxxxx> wrote:On Tue, Feb 12, 2019 at 12:55 PM github kran <githubkran@xxxxxxxxx> wrote:On Tue, Feb 12, 2019 at 12:33 PM Michael Lewis <mlewis@xxxxxxxxxxx> wrote:Did you update the stats by running ANALYZE on the tables involved, or perhaps the entire database on the 'Non prod Aurora RDS instance'? Can you share the two execution plans?Here is the plan for both of the DB instances.Michael - Did you get any chance to look at this issue. Also we see there is a sequential scan being done instead of index scan.On Tue, Feb 12, 2019 at 11:27 AM github kran <githubkran@xxxxxxxxx> wrote:On Tue, Feb 12, 2019 at 7:07 AM github kran <githubkran@xxxxxxxxx> wrote:On Mon, Feb 11, 2019 at 6:00 PM github kran <githubkran@xxxxxxxxx> wrote:On Mon, Feb 11, 2019 at 3:29 PM Michael Lewis <mlewis@xxxxxxxxxxx> wrote:Are default statistics target the same on both prod and AWS? Have you analyzed all tables being used in this query to ensure stats are up proper? If the optimizer is choosing a different plan, then the stats must be different IMO.Michael Lewis | Software EngineerEntrataThanks for your reply I have verified few of the tables and their default statistics target and they seem to be same but is there anything in particular you want me to look at it to differentiate Prod and Non prod databases ?. ( Also the DB instance size is same but there is littlemore data in the Non prod Aurora RDS instance compared to Prod instance).Query used.select * from pg_stats where tablename = 'tableName'On Mon, Feb 11, 2019 at 2:15 PM github kran <githubkran@xxxxxxxxx> wrote:Hi Postgres Team,We are using AWS Aurora RDS Postgres DB 9.6.8 Engine. We recently deleted few million rows from the database and ran into a issue in one of our dev account where theDB was not normal after this deletion. We did re index, vacuuming entire database but we couldnt bring it to the same state as earlier. So next steps we deleted the database andrecreated the database by copying the snapshot from a production instance. Further did vacumming, re-index on the database.After this now the dev database seems to be in a better state than earlier but we are seeing few of our DB calls are taking more than 1 minute when we are fetching data and we observedthis is because the query plan was executing a hash join as part of the query whereas a similar query on prod instance is not doing any hash join and is returning faster.Also we did not want to experiment by modifing the DB settings by doing enable_hash_join to off or random_page_count to 1 as we dont have these settings in Prod instance.Note:The partition table sizes we have here is between 40 GB to 75 GB and this is our normal size range, we have a new partition table for every 7 days.Appreciate your ideas on what we could be missing and what we can correct here to reduce the query latency.ThanksgithubKran