Greetings, * Tiemen Ruiten (t.ruiten@xxxxxxxxxxx) wrote: > On Fri, Jun 14, 2019 at 5:43 PM Stephen Frost <sfrost@xxxxxxxxxxx> wrote: > > * Tiemen Ruiten (t.ruiten@xxxxxxxxxxx) wrote: > > > checkpoint_timeout = 60min > > > > That seems like a pretty long timeout. > > My reasoning was that a longer recovery time to avoid writes would be > acceptable because there are two more nodes in the cluster to fall back on > in case of emergency. Ok, so you want fewer checkpoints because you expect to failover to a replica rather than recover the primary on a failure. If you're doing synchronous replication, then that certainly makes sense. If you aren't, then you're deciding that you're alright with losing some number of writes by failing over rather than recovering the primary, which can also be acceptable but it's certainly much more questionable. > > > My problem is that checkpoints are taking a long time. Even when I run a > > > few manual checkpoints one after the other, they keep taking very long, > > up > > > to 10 minutes: > > > > You haven't said *why* this is an issue... Why are you concerned with > > how long it takes to do a checkpoint? > > During normal operation I don't mind that it takes a long time, but when > performing maintenance I want to be able to gracefully bring down the > master without long delays to promote one of the standby's. I'm getting the feeling that your replicas are async, but it sounds like you'd be better off with having at least one sync replica, so that you can flip to it quickly. Alternatively, having a way to more easily make the primary to accepting new writes, flush everything to the replicas, report that it's completed doing so, to allow you to promote a replica without losing anything, and *then* go through the process on the primary of doing a checkpoint, would be kind of nice. Then again, you run into the issue that if your async replicas are very far behind then you're still going to have a long period of time between the "stop accepting new writes" and "finished flushing everything to the replicas". > > The time information is all there and it tells you what it's doing and > > how much had to be done... If you're unhappy with how long it takes to > > write out gigabytes of data and fsync hundreds of files, talk to your > > storage people... > > I am the storage people too :) Great! Make it go faster. :) Thanks, Stephen
Attachment:
signature.asc
Description: PGP signature