Hello Postgres community, Instructure is looking to commission a bit of open source software that manages the cost of a database cluster in a cloud environment. Specifically, we have hundreds of primary/secondary postgres clusters in AWS' EC2, which are busy during the day and not so busy at night. We would like to shrink the instance size of the secondary db once its load has reliably fallen below a threshold and its load has been shifted to the primary, and then undo all that before the morning rush. (We don't want to stop it, because, while that would save even more money, we need it to keep streaming.) The project is generally pretty easy, but there are some wrinkles that make it not quite as straightforward as it sounds: 1. We have lots of historical data to help us predict the time of day that these thresholds should be crossed and the appropriate action should be taken, but load patterns shift over time and the model will need to be continually updated. Also, there is always the chance some organic client load comes along which forces us to enlarge our secondary DB earlier than we otherwise would. Determining how wide the spike of unexpected client load would need to be to force such a premature embiggening of the secondary is going to need some user-adjustable knobs. Additionally, we run a lot of maintenance work at night (think pg_repack) and we would not want to bother enlarging our secondary for this kind of non-client load. So there will need to be some user-adjustable filters to try to ignore this expected load when determining if a premature embiggening should happen. 2. Occasionally we will have planned client workloads in our normally idle times. In those cases, we'd want to be able to easily schedule this system not to shrink anything for those windows. Sometimes we know about these events weeks in advance, so we'll want to be able to easily manage the list of upcoming exception windows, because if something is far enough into the future, we will doubtlessly wonder several times if we've remembered to take it into account. 3. When embiggening the secondary node back to its proper size, it might that AWS EC2 has no more instances of the desired type available. There are almost always multiple alternative instances sizes that would be sufficient, if not as cheap, but the system will need to have the smarts to consult a ranked table of choices when necessary. 4. Before shrinking a secondary db, we would want to alter a DNS cname to shift load from it to the primary, and not proceed until client load has been entirely migrated. When we inflate the secondary in the morning, we would need to change the cname back, but only after streaming replication has resumed. In our case, changing DNS is a simple command line invocation. 5. Most of our postgres clusters have one secondary, but some have multiple. There would need to be flexibility in how many secondaries each cluster has, and the proper calculations made to know how much load would be migrating to the primary come shrinkage time. 6. Long running transactions on the secondary will need to be addressed; fundamentally, they'll have to be cancelled, but ideally different kinds of queries might get different runways. https://gitlab.com/depesz/pg_terminator/-/tree/master provides a good possible framework. Required DeliverablesWe need software that does all of the above. It needs to run on Ubuntu Linux, and be written in Ruby, bash, perl, Python, or Go (the languages our team has the most knowledge in). We expect a unit test framework. We will be hosting the project in Instructure's Github account, and will be happy to accept patches from the public. As an open-source project, it is easy to conceive of a system that would do all this with a plugin architecture. Of course, the system we are commissioning would need to support Instructure's needs, but could easily be extended to also support AWS RDS databases, or MySQL databases, or Azure, or GCP, etc. It could have callbacks for arbitrary actions to take before shrinking or enlarging a secondary, or various ways to compute times to take action, or a variety of ways to manage exception windows to not take action, and so on. We are only interested in paying for a system that works for us, but if a little abstraction can be slipped in to make this system easily extensible for others, so much the better. Timeline
Required Elements of Proposal
Evaluation Criteria of Proposal
|