Re: Snapshot getting stuck

Torkil Svensgaard <torkil@xxxxxxxx> · Wed, 14 Aug 2024 11:02:02 +0200

Hi guys

No changes to the network. The Palo Alto firewall is outside our control 
and we do not have access to logs.

I got an off list suggestion to do the following so I guess we'll try that:

"
Enable fstrim on your vms if using 18.2.1 ? Or look for a config setting 
make it false.
rbd_skip_partial_discard=false

Else
Upgrade to 18.2.4 which has bug fix if a data gets deleted by vm then 
information passes to ceph.
"

Thanks,

Torkil

On 13/08/2024 18:41, Joachim Kraftmayer wrote:
Hi Torkil
i would check the logs of the firewalls. First I would check the palo alto
firewall logs.
Joachim

Am Di., 13. Aug. 2024 um 14:36 Uhr schrieb Eugen Block <eblock@xxxxxx>:

Hi Torkil,

did anything change in the network setup? If those errors haven't
popped up before, what changed? I'm not sure if I have seen this one
yet...

Zitat von Torkil Svensgaard <torkil@xxxxxxxx>:

Ceph version 18.2.1.

We have a nightly backup job snapshotting and exporting all RBDs
used for libvirt VMs. Since a couple of weeks ago we've seen one or
more getting stuck like this on 3 occasions, so intermittently:

"
zoperator@yggdrasil:~/backup$ cat
/mnt/scratch/personal/zoperator/slurm-2888600.out
Creating snap: 10% complete...2024-08-07T10:35:24.687+0200
7f5a2a44b640 0 --2- 172.21.14.135:0/3311921296 >>
[v2:172.21.15.135:3300/0,v1:172.21.15.135:6789/0]
conn(0x7f59fc002060 0x7f59fc0094b0 unknown :-1 s=AUTH_CONNECTING
pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0
tx=0).send_auth_request get_initial_auth_request returned -2

2024-08-07T10:36:02.064+0200 7f5a2a44b640  0 --2-
172.21.14.135:0/3311921296 >>
[v2:172.21.15.150:3300/0,v1:172.21.15.150:6789/0]
conn(0x7f59fc00a2b0 0x7f59fc015010 unknown :-1 s=AUTH_CONNECTING
pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0
tx=0).send_auth_request get_initial_auth_request returned -2

2024-08-07T10:37:38.191+0200 7f5a23fff640  0 --2-
172.21.14.135:0/3311921296 >>
[v2:172.21.15.149:3300/0,v1:172.21.15.149:6789/0]
conn(0x7f5a0c063ea0 0x7f5a0c066320 unknown :-1 s=AUTH_CONNECTING
pgs=0 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0
tx=0).send_auth_request get_initial_auth_request returned -2

2024-08-07T10:38:16.677+0200 7f5a28c48640 -1 librbd::ImageWatcher:
0x7f5a100076d0 image watch failed: 140024590898336, (107) Transport
endpoint is not connected

2024-08-07T10:38:16.677+0200 7f5a28c48640 -1 librbd::Watcher:
0x7f5a100076d0 handle_error: handle=140024590898336: (107) Transport
endpoint is not connected
"

The VM is also throwing stack traces from stuck I/O.

Every VM seen affected by this map the RBD through multiple
firewalls so that is likely a factor.

Hypervisor <-> Palo Alto firewall <-> OpenBSD firewall <-> Ceph

Any ideas? I haven't found anything in the ceph logs yet.

Mvh.

Torkil

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Torkil Svensgaard
Sysadmin
MR-Forskningssektionen, afs. 714
DRCMR, Danish Research Centre for Magnetic Resonance
Hvidovre Hospital
Kettegård Allé 30
DK-2650 Hvidovre
Denmark
Tel: +45 386 22828
E-mail: torkil@xxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx