Re: 3.2.6 -> 3.6.0 replication, restarting after sync_shutdown_file

"AndrewHardy via Info" <info@xxxxxxxxxxxxxxxxxx> · Mon, 30 Jan 2023 19:56:56 +1300

Hi,

I tackle this problem in the following way:

My setup:
   1. One master site/datacenter, handles all connections imaps and web based dav stuff.
   2. Two standby / backup datacentres using sync client replication. 

On master, imapd.conf I have per datacentre replication configurations. Basically ensuring a replica exists on both slave datacentres. 

I have a custom python script that regularly checks running processes to ensure ‘sync_client -r -n {namedefinedinimap.conf}} is running. (You could use a dtandard bash script, entirely up to you).

If the per data-centre process isn't running, force it to run using a script and periodic execution using cron. If its running, do nothing. By doing this, even if master for some reason crashes the sync tasks, you can use a independent script and execute it to what ever cron schedule you’re comfortable with.

I also take it one step forward and use my ansible/automation node to sanity check that both the sync client health checker script is running  and executing successfully and also do some magical stuff to ensure that the number of emails per mailbox on the master compared to those on the slaves are within a defined tolerance (check cyr_adm options,
It can output some cool metrics and with some parsing can do some cool stuff with sanity checking consistency between master/slave nodes). The key here is tolerance/threshold because with this approach the master and slaves will likely always be out by at least a few (depending on how busy the server is and the delay taken to run the health check on master until slave run time). If the count exceeds what I’d expect, I tell the playbook to also email me (do all the comparison stuff on the automation server). A watch the watcher of sorts.  Its not the most graceful solution on the planet but it has served me well over the past few years.

Perhaps this might be desirable if you don’t want to cause outages to users / restarting master is a bit aggressive. 

There may be better solutions but this is one that works for me.

-A 

On 30/01/2023, at 5:51 PM, Deborah Pickett via Info <info@xxxxxxxxxxxxxxxxxx> wrote:
> 
> Hi all,
> 
> A few random notes that might help future readers searching the list archives, and a question at the end:
> 
> After one struggle too many with the XBACKUP feature, I've bitten the bullet and switched to rolling replication to do my Cyrus backups. (So I won't be raising any more issues about XBACKUP; thanks Ellie for all the prior help with that.)
> 
> I'm now replicating from the main server (Debian 10 buster-backports, Cyrus 3.2.6) to a backup server at another site (Debian 11 bullseye-backports, Cyrus 3.6.0) within our VPN.
> 
> The Debian 10 buster-backports package is currently at 3.2.6, which would normally be not recent enough to safely replicate or upgrade to 3.6.0, but there's an explicit patch at https://sources.debian.org/src/cyrus-imapd/3.2.6-2%2Bdeb11u2/debian/patches/prepare-3.6-upgrade.patch/ which ensures that the Debian package version applies a uniqueid to every mailbox. I ran a manual check on the 3.2.6 server and confirmed that every folder has a uniqueid and the minor version is 16. Nice!
> 
> Replication over bare IMAP runs perfectly. I couldn't get replication to happen over IMAPS. I've got a Let's Encrypt certificate installed on the replica, and it's installed and working, tested with imtest. But even changing the sync_host configuration to "remote_host_fqdn:993/tls", which has been reported to work by some users over the years, produced a TLS library error ("Unable to get local issuer certificate") which I am guessing is because sync_client can't see the root CA file. I didn't try any harder to make this work; if I think that our VPN backbone is at risk then I'll put an SSH tunnel in.
> 
> My backup plan now is to shut down cyr_master on the replica periodically, take a filesystem snapshot offsite, and start it up again. The master (a live server connected to by users) will pause replication while the replica is offline, and resume when it comes back online. At least, that's the plan, but I've found that if I just `systemctl stop cyrus-imapd` on the replica, then sync_client on the master logs errors like:
>   cyrus/sync_client[35180]: Error in do_sync(): bailing out! Bad protocol
> and doesn't resume after I restart the replica. I end up having to `systemctl restart cyrus-imapd` on the master, which resumes synchronization but results in downtime for users.
> 
> So I've now got a line in /etc/imapd.conf:
>   MyChannelName_sync_shutdown_file: /var/lib/cyrus/sync/MyChannelName/shutdown
> and touching that file indeed causes sync_client to shut down gracefully, but I don't know how to inform cyr_master to restart sync_client again, short of `systemctl restart cyrus-imapd`, which again results in downtime for users. Sending a SIGHUP doesn't seem to do anything. My sync_client entry is in the cyrus.conf STARTUP section. If I moved it to the DAEMON section, it might start up again too soon, so I don't want that.
> 
> Does anyone have a replication-as-backup methodology that avoids sync_client crashing, keeps it offline while the replica is backing up, starts it up again when the backup completes, doesn't stop the master from processing user requests, and avoids race conditions? Thanks in advance.
> 
> --
> Deborah Pickett
> System Administrator
> Polyfoam Australia Pty Ltd

------------------------------------------
Cyrus: Info
Permalink: https://cyrus.topicbox.com/groups/info/T2846d85f9a3f91b8-Mc6bcc82c9748f8ddf9bbb1ec
Delivery options: https://cyrus.topicbox.com/groups/info/subscription