RBD-mirror instabilities

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Cephalopodians,

running 13.2.6 on the source cluster and 14.2.5 on the rbd mirror nodes and the target cluster,
I observe regular failures of rbd-mirror processes. 

With failures, I mean that traffic stops, but the daemons are still listed as active rbd-mirror daemons in
"ceph -s", and the daemons are still running. This comes in sync with a hefty load of below messages in the mirror logs. 

This happens "sometimes" when some OSDs go down and up in the target cluster (which happens each night since the disks in that cluster
shortly go offline during "online" smart self-tests - that's a problem in itself, but it's a cluster built from hardware that would have been trashed otherwise). 

The rbd daemons keep running in any case, but synchronization stops. If not all rbd mirror daemons have failed (we have three running, and it usually does not hit all of them),
the "surviving" seem(s) not to take care of the images the other daemons had locked. 

Right now, I am eyeing with a "quick solution" of regularly restarting the rbd-mirror daemons, but if there are any good ideas on which debug info I could collect
to get this analyzed and fixed, that would of course be appreciated :-). 

Cheers,
	Oliver

-----------------------------------------------
2019-12-24 02:08:51.379 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb968d00 [2/aabba863-89fd-4ea5-bb8c-0f417225d394] handle_process_entry_safe: failed to commit journal event: (108) Cannot send after transport endpoint shutdown
2019-12-24 02:08:51.379 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb968d00 [2/aabba863-89fd-4ea5-bb8c-0f417225d394] handle_replay_complete: replay encountered an error: (108) Cannot send after transport endpoint shutdown
...
2019-12-24 02:08:54.392 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb87bb00 [2/23699357-a611-4557-9d73-6ff5279da991] handle_process_entry_safe: failed to commit journal event: (125) Operation canceled
2019-12-24 02:08:54.392 7f31c530e700 -1 rbd::mirror::ImageReplayer: 0x559dcb87bb00 [2/23699357-a611-4557-9d73-6ff5279da991] handle_replay_complete: replay encountered an error: (125) Operation canceled
2019-12-24 02:08:55.707 7f31ea358700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x559dce2e05b0 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
2019-12-24 02:08:55.707 7f31ea358700 -1 rbd::mirror::image_replayer::GetMirrorImageIdRequest: 0x559dcf47ee70 handle_get_image_id: failed to retrieve image id: (108) Cannot send after transport endpoint shutdown
...
2019-12-24 02:08:55.716 7f31f5b6f700 -1 rbd::mirror::ImageReplayer: 0x559dcb997680 [2/f8218221-6608-4a2b-8831-84ca0c2cb418] operator(): start failed: (108) Cannot send after transport endpoint shutdown
2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80 start_image_replayer: global_image_id=0577bd16-acc4-4e9a-81f0-c698a24f8771: blacklisted detected during image replay
2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80 start_image_replayer: global_image_id=05bd4cca-a561-4a5c-ad83-9905ad5ce34e: blacklisted detected during image replay
2019-12-24 02:09:25.707 7f31f5b6f700 -1 rbd::mirror::InstanceReplayer: 0x559dcabd5b80 start_image_replayer: global_image_id=0e614ece-65b1-4b4a-99bd-44dd6235eb70: blacklisted detected during image replay
-----------------------------------------------

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux