Re: RBD-mirror instabilities

Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> · Fri, 27 Dec 2019 02:43:42 +0100

Dear Cephalopodians,

for those following along through the holiday season, here's my "quick hack" for now, since our rbd mirrors keep going into "blacklisted" state whenever a bunch of OSDs restart in the cluster. 

For those not following along, nice holidays to you and hopefully some calm days off :-). 

To re-summarize: Once our rbd-mirrors are in that "blacklisted" state, they don't recover by themselves, so I think what is missing would be an auto-restart / reconnect after blacklisting
(and, of course, an idea why the daemons' clients get blacklisted when OSDs restart?). Let me know if I should open a tracker issue on that one,
or can provide more information (it happens every few nights for us). 

Since I was looking to restart them only in case of failure, I came up with some lengthy commands. 

I now have two cronjobs on the rbd-mirror daemon nodes set up, one which works "whatever happens", restarting an rbd mirror if any image sync is broken:

 rbd --id=rbd_mirror_backup mirror pool status | egrep -q 'unknown|stopped' && systemctl -q is-active ceph-rbd-mirror@rbd_mirror_backup.service && systemctl restart ceph-rbd-mirror@rbd_mirror_backup.service

I run this hourly. With multiple rbd mirrors, this does not catch everything, though. If we enter the failure state (blacklisted rbd mirror clients), this part only ensures at least one client recovers
and takes over the full load. To get the other clients to restart only if they are also blacklisted, I do:

 ceph daemon /var/run/ceph/ceph-client.rbd_mirror_backup.$(systemctl show --property MainPID ceph-rbd-mirror@rbd_mirror_backup.service | sed 's/MainPID=//').*.asok rbd mirror status | grep -q Replaying || (systemctl -q is-active ceph-rbd-mirror@rbd_mirror_backup.service && systemctl restart ceph-rbd-mirror@rbd_mirror_backup.service)

This also runs hourly and queries the daemon state itself. If there's no image in "Replaying" state, something is wrong and the daemon is restarted. 
Technically, the latter cronjob should be sufficient, the first one is only there in case the daemons go completely awry (but I did not observe this up to now). 

I made two interesting observations, though:
- It seems the log of the rbd-mirror is sometimes not filled with errors at all. The cause seems to be that the "rbd-mirror" processes are not SIGHUPed in the logrotate rule shipped with ceph-base. 
  I created a tracker issue here:
   https://tracker.ceph.com/issues/43428
- The output of the "rbd mirror status" command is not valid JSON, it is missing the trailing brace. 
  I created a tracker issue here:
   https://tracker.ceph.com/issues/43429

Cheers,
	Oliver

Am 24.12.19 um 04:39 schrieb Oliver Freyermuth:
> Dear Cephalopodians,
> 
> running 13.2.6 on the source cluster and 14.2.5 on the rbd mirror nodes and the target cluster,
> I observe regular failures of rbd-mirror processes. 
> 
> With failures, I mean that traffic stops, but the daemons are still listed as active rbd-mirror daemons in
> "ceph -s", and the daemons are still running. This comes in sync with a hefty load of below messages in the mirror logs. 
> 
> This happens "sometimes" when some OSDs go down and up in the target cluster (which happens each night since the disks in that cluster
> shortly go offline during "online" smart self-tests - that's a problem in itself, but it's a cluster built from hardware that would have been trashed otherwise). 
> 
> The rbd daemons keep running in any case, but synchronization stops. If not all rbd mirror daemons have failed (we have three running, and it usually does not hit all of them),
> the "surviving" seem(s) not to take care of the images the other daemons had locked. 
> 
> Right now, I am eyeing with a "quick solution" of regularly restarting the rbd-mirror daemons, but if there are any good ideas on which debug info I could collect
> to get this analyzed and fixed, that would of course be appreciated :-). 
> 
> Cheers,
> 	Oliver
> 
> -----------------------------------------------
> 2019-12-24 02:08:51.379 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb968d00 [2/aabba863-89fd-4ea5-bb8c-0f417225d394] handle_process_entry_safe: failed to commit journal event: (108) Cannot send after transport endpoint shutdown
> 2019-12-24 02:08:51.379 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb968d00 [2/aabba863-89fd-4ea5-bb8c-0f417225d394] handle_replay_complete: replay encountered an error: (108) Cannot send after transport endpoint shutdown
> ...
> 2019-12-24 02:08:54.392 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb87bb00 [2/23699357-a611-4557-9d73-6ff5279da991] handle_process_entry_safe: failed to commit journal event: (125) Operation canceled
> 2019-12-24 02:08:54.392 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb87bb00 [2/23699357-a611-4557-9d73-6ff5279da991] handle_replay_complete: replay encountered an error: (125) Operation canceled
> 2019-12-24 02:08:55.707 7f31ea358700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x559dce2e05b0 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
> 2019-12-24 02:08:55.707 7f31ea358700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x559dcf47ee70 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
> ...
> 2019-12-24 02:08:55.716 7f31f5b6f700 -1 rbd::mirror::ImageReplayer: 0x559dcb997680 [2/f8218221-6608-4a2b-8831-84ca0c2cb418] operator(): start failed: (108) Cannot send after transport endpoint shutdown
> 2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80 start_image_replayer: global_image_id=0577bd16-acc4-4e9a-81f0-c698a24f8771: blacklisted detected during image replay
> 2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80 start_image_replayer: global_image_id=05bd4cca-a561-4a5c-ad83-9905ad5ce34e: blacklisted detected during image replay
> 2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80 start_image_replayer: global_image_id=0e614ece-65b1-4b4a-99bd-44dd6235eb70: blacklisted detected during image replay
> -----------------------------------------------
> 

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx