Hello All,
It's related to data flow to OLTP database which is mostly going to be postgres.
We are designing a system which is going to move data from input files(in Avro format) to Goldengate to kafka topics to the database. Incoming files-->GGS--> KAFKA-->OLTP Database. This would be a heavy transactional system processing ~10K txn/second. The database is supposed to show the near real time transactions to the users. The transactions which come from kafka topics will be asynchronous in nature and also there are chances of duplicate data being ingested from kafka topics. So the data has to be stitched/ updated/deduplicated before showing it to the users a complete transaction or say before persisting it to the normalized data model which would be ready for the querying by the end users.
So where should we perform these stitching/update/deduplication stuff in this workflow? Should it happen inside the application somewhere in the kafka consumer(using poison pill concept) or should it happen in a stage schema in the database by persisting all the pieces of transaction as it is coming from kafka topics. Adding another stage layer within the database is going to add some more time to the data to be visible to the users and thus it may not be near real time. As it will take some more time to move the data to the main transaction/normalized tables from the stage tables.
Or should we persist the data as is in the stage area and show the data from stage itself if some users are okay with partial transaction data and showing the complete transaction data from the normalized table to other users who want to see it as a complete transaction but with some delay?
What is the appropriate design to address such use cases?
So where should we perform these stitching/update/deduplication stuff in this workflow? Should it happen inside the application somewhere in the kafka consumer(using poison pill concept) or should it happen in a stage schema in the database by persisting all the pieces of transaction as it is coming from kafka topics. Adding another stage layer within the database is going to add some more time to the data to be visible to the users and thus it may not be near real time. As it will take some more time to move the data to the main transaction/normalized tables from the stage tables.
Or should we persist the data as is in the stage area and show the data from stage itself if some users are okay with partial transaction data and showing the complete transaction data from the normalized table to other users who want to see it as a complete transaction but with some delay?
What is the appropriate design to address such use cases?
Regards
Yudhi