Snapshot getting stuck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ceph version 18.2.1.

We have a nightly backup job snapshotting and exporting all RBDs used for libvirt VMs. Since a couple of weeks ago we've seen one or more getting stuck like this on 3 occasions, so intermittently:

"
zoperator@yggdrasil:~/backup$ cat /mnt/scratch/personal/zoperator/slurm-2888600.out Creating snap: 10% complete...2024-08-07T10:35:24.687+0200 7f5a2a44b640 0 --2- 172.21.14.135:0/3311921296 >> [v2:172.21.15.135:3300/0,v1:172.21.15.135:6789/0] conn(0x7f59fc002060 0x7f59fc0094b0 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).send_auth_request get_initial_auth_request returned -2

2024-08-07T10:36:02.064+0200 7f5a2a44b640 0 --2- 172.21.14.135:0/3311921296 >> [v2:172.21.15.150:3300/0,v1:172.21.15.150:6789/0] conn(0x7f59fc00a2b0 0x7f59fc015010 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).send_auth_request get_initial_auth_request returned -2

2024-08-07T10:37:38.191+0200 7f5a23fff640 0 --2- 172.21.14.135:0/3311921296 >> [v2:172.21.15.149:3300/0,v1:172.21.15.149:6789/0] conn(0x7f5a0c063ea0 0x7f5a0c066320 unknown :-1 s=AUTH_CONNECTING pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).send_auth_request get_initial_auth_request returned -2

2024-08-07T10:38:16.677+0200 7f5a28c48640 -1 librbd::ImageWatcher: 0x7f5a100076d0 image watch failed: 140024590898336, (107) Transport endpoint is not connected

2024-08-07T10:38:16.677+0200 7f5a28c48640 -1 librbd::Watcher: 0x7f5a100076d0 handle_error: handle=140024590898336: (107) Transport endpoint is not connected
"

The VM is also throwing stack traces from stuck I/O.

Every VM seen affected by this map the RBD through multiple firewalls so that is likely a factor.

Hypervisor <-> Palo Alto firewall <-> OpenBSD firewall <-> Ceph

Any ideas? I haven't found anything in the ceph logs yet.

Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux