Cloudstack agent crashed JVM with exception in librbd

Voloshanenko Igor <igor.voloshanenko@xxxxxxxxx> · Thu, 29 Oct 2015 17:38:01 +0200

Hi Wido and all community.
We catched very idiotic issue on our Cloudstack installation, which related to ceph and possible to java-rados lib.

So, we have constantly agent crashed (which cause very big problem for us... ).

When agent crashed - it's crash JVM. And no event in logs at all.
We enabled crush dump, and after crash we see next picture:

#grep -A1 "Problematic frame" < /hs_err_pid30260.log
 Problematic frame:
 C  [librbd.so.1.0.0+0x5d681]

# gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core
(gdb)  bt
...
#7  0x00007f30b9a1fed2 in ceph::log::SubsystemMap::should_gather (level=<optimized out>, sub=<optimized out>, this=<optimized out>)
    at ./log/SubsystemMap.h:62
#8  0x00007f30b9a3b693 in ceph::log::SubsystemMap::should_gather (this=<optimized out>, sub=<optimized out>, level=<optimized out>)
    at ./log/SubsystemMap.h:61
#9  0x00007f30b9d879be in ObjectCacher::flusher_entry (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527
#10 0x00007f30b9d9851d in ObjectCacher::FlusherThread::entry (this=<optimized out>) at osdc/ObjectCacher.h:374

From ceph code, this part executed when flushing cache object... And we don;t understand why. Becasue we have absolutely different race condition to reproduce it.

As cloudstack have not good implementation yet of snapshot lifecycle, sometime, it's happen, that some volumes already marked as EXPUNGED in DB and then cloudstack try to delete bas Volume, before it's try to unprotect it.

Sure, unprotecting fail, normal exception returned back (fail because snap has childs... )

2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor] (Thread-1304:null) Executing: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11
2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor] (Thread-1304:null) Execution is successful.
2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the image
2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at cephmon.anolim.net:6789
2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) Unprotecting snapshot cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap
2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor] (agentRequest-Handler-5:null) Failed to delete volume: com.cloud.utils.exception.CloudRuntimeException: com.ceph.rbd.RbdException: Failed to unprotect snapshot cloudstack-base-snap
2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId: 161344838950, via: 4, Ver: v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException: com.ceph.rbd.RbdException: Failed to unprotect snapshot cloudstack-base-snap","wait":0}}] }
2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) Processing command: com.cloud.agent.api.GetHostStatsCommand
2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n 1| awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo $idle
2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Execution is successful.
2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /bin/bash -c freeMem=$(free|grep cache:|awk '{print $4}');echo $freeMem
2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Execution is successful.

BUT, after 20 minutes - agent crashed... If we remove all childs and create conditions for cloudstack to delete volume - all OK - no agent crash in 20 minutes...

We can't connect this action - Volume delete with agent crashe... Also we don't understand why +- 20 minutes need to last, and only then agent crashed...

From logs, before crash - only GetVMStats... And then - agent started...

2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] (UgentTask-5:null) Sending ping: Seq 4-1343:  { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-7-106-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-1683-1984-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-11-504-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-325-616-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-10-52-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-941-1237-VM":{"state":"PowerOn","host":"cs2.anolim.net"}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":4,"wait":0}}] }
2015-10-29 09:21:55,149 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null) Received response: Seq 4-1343:  { Ans: , MgmtId: 161344838950, via: 4, Ver: v1, Flags: 100010, [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":4,"wait":0},"result":true,"wait":0}}] }
2015-10-29 09:22:20,755 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) Processing command: com.cloud.agent.api.GetVmStatsCommand
2015-10-29 09:23:02,630 INFO  [cloud.agent.AgentShell] (main:null) Agent started
2015-10-29 09:23:02,632 INFO  [cloud.agent.AgentShell] (main:null) Implementation Version is 4.3.2

Can you please give us direction to go deeper... As we dont udnerstand what happened...

Issue reproducing very easy:

(to speed up delete process we set storage.cleanup.interval=10)

1. Create Volume
2. Attach it to VM
3. Create snapshot from this volume by hand on ceph
4. Do snap_protect by hand on ceph for snap from previous step
5. Create child via snap clone
6. Deattach volume from VM and delete it

Our problem - that this happen not in such lab situation, but in real life...
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com