Thank you so much for the response.
Got your point, will check if we really need details or summary for the historical data. But it looks like we will need detailed transaction data for ~2 years at least.
My understanding was that AWS has two different offerings and "aurora postgresql" is more performant and costlier than "RDS postgresql". Will double check on this though.
However , how to prove RDS/aurora postgresql is going to serve our OLTP requirement here , similar to the current on premise Oracle exadata. For the OLTP use case we are expecting ~15K TPS write and 2K TPS read and the response for the UI queries are expected to be within subseconds. But yes, as i mentioned individual transactions will be batched and then will be written to the database, so this should have lesser resource consumption and contention created.
To test if Aurora postgresql will be comparable to cater the above needs (in regards to the expected performance with nominal cost) ,how should we test it? As we won't be able to test everything right away, Should we test basic read and write performance and benchmark to have some confidence and go ahead with development?
Say for example if one transaction consists of ~8 Inserts we can create a sample target table on aurora Postgre with required indexes/constraints and try running those inserts from multiple threads(for concurrency) using blazemeter and see/compare the response time, CPU, IO, Memory usage for the Postgresql database with set TPS. Similarly to see read performance, we can run multiple select queries from blazemeter and compare the response time. Is this the correct approach for validating the database here or any other approach exists?
And another question coming to mind, I read in past Vaccum to be a problem in postgresql, is it going to give trouble in Aurora postgresql too, for such a highly transactional read/write system? How to test/validate that?
To test if Aurora postgresql will be comparable to cater the above needs (in regards to the expected performance with nominal cost) ,how should we test it? As we won't be able to test everything right away, Should we test basic read and write performance and benchmark to have some confidence and go ahead with development?
Say for example if one transaction consists of ~8 Inserts we can create a sample target table on aurora Postgre with required indexes/constraints and try running those inserts from multiple threads(for concurrency) using blazemeter and see/compare the response time, CPU, IO, Memory usage for the Postgresql database with set TPS. Similarly to see read performance, we can run multiple select queries from blazemeter and compare the response time. Is this the correct approach for validating the database here or any other approach exists?
And another question coming to mind, I read in past Vaccum to be a problem in postgresql, is it going to give trouble in Aurora postgresql too, for such a highly transactional read/write system? How to test/validate that?
On Sun, 10 Dec 2023 at 01:29, Ron Johnson <ronljohnsonjr@xxxxxxxxx> wrote:
On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0000@xxxxxxxxx> wrote:
Ron Johnson ronljohnsonjr@xxxxxxxxx
wrote:"OK" is relative, but it's what we did in a similar situation: two years of data on-line and 5 years of data in compressed files in S3. (We're required to keep data for 7 years, but they never ask for records more than 2 years old. If they ever do, we'll manually load the relevant data back into PG.)
(I can't imagine that querying billions of unindexed flat-file records via S3 would be fast.)
How often do end users look for data more than 90 days old? Two years old?
How quickly does the old data need to be available?
Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would balance speed and cost (or maybe Snowflake would just drive up costs).Thank You so much Ron.
When you said "two years of data on-line and 5 years of data in compressed files in S3." So do you mean two years of data in aurora postgre which will be OLTP database and rest just dump as is in S3 bucket and when we need any such query for those data , just simply read and dump those S3 files back in the aurora postgre?(RDS Postgresql, not Aurora, but that's beside the point.)Yes.But keep reading...In the currently running oracle exadata system , it has SIX months of data (which is ~200TB) and the transaction tables are all range partitioned on a daily basis. And out of that ~2months of data gets frequently queried and other ~4months of data gets queried less frequently. However, in the target architecture which we want to build on cloud here, there are some requirements for the analytics/data science team to query ~3years history. Beyond ~3years we may need that data rarely.
We were initially thinking of just having one database to serve both OLTP and OLAP use cases(somewhat like oracle exadata currently doing for us) but it seems we don't have such an option here on AWS. Postgre will serve OLTP use case whereas Snowflake will serve OLAP use case.
So do you suggest having both the databases in use, like recent transaction data for last 3 months should be streamed to aurora postgre, then from 3months till 3years of data should be parked in snowflake which will serve OLAP/analytics use case. and from 3years till 10years will be kept in S3 (as parquet or Iceberg format) so that even Snowflake can query those directly when needed.
OR
Do you suggest just keeping last ~3months of data on Aurora postgre and rest everything on snowflake considering it will store those as compressed format and also storage is cheap(23$ per TB per month)?
Few colleagues are pointing to databricks for the analytics use case. Is that a good option here?I can't answer that without knowing what the end users actually need (details, or just summaries of historical data, in different tiers).You all will have to do the cost:benefit analysis of different architectures.On Sat, 9 Dec 2023 at 16:43, veem v <veema0000@xxxxxxxxx> wrote:Hello All,Although it's not exactly related to opensource postgre but want to ask this question here to understand colleagues' view, considering having decades of experience in the database world, We want some guidance, if the below design looks okay for our customer use case.
We currently have financial systems transaction data streams to Oracle exadata(X9) on-premise. This database supports processing of 400million transactions per day. A single transaction for us is a combination of 7-8 inserts into different transaction tables with Indexes , unique constraints etc defined on those. The transactions processed/committed in batches(~1000 batch size) in the database. And this system persists data for ~6 months. We do have all sorts of OLAP(daily/monthly batch reports running) applications run on the same database along with some user facing UI applications showing customer transactions. So it's basically currently serving a hybrid workload and is one stop solution for all use cases.
Many of the applications are moving from on premise to AWS cloud as part of modernization journey and AWS being chosen cloud partner also the product is expected to expand across more regions and this system is expected to serve increase in the transaction volume. And also we have a requirement to persist transaction data for ~10years to have those available for analytics/data science use cases.
So the team is thinking of splitting it into two parts
1)OLTP type use case in which we will persist/write the transaction data faster and show it to the UI related apps , in near real time/quickest possible time. and this database will store Max 60-90 days of transaction data. Not sure if we have an option of Oracle exadata equivalent on AWS, so team planning of using/experimenting with Aurora postgres. Please correct me, if there are any other options we should use otherwise?
2)Then move the data beyond ~90 days into another database or object storage S3 which will keep it there for ~10 years and will be queryable using the necessary API's. That is supposed to cater to Olap/analytics/data science use cases etc.
Is the above design is okay? and also in regards to the second point above i.e. persisting the historical data (that to be in queryable state), should we go for some database like snowflake or should just keep it on S3 as is and make those queryable through APIs. Please advice?