Re: Cloudstack agent crashed JVM with exception in librbd

Voloshanenko Igor <igor.voloshanenko@xxxxxxxxx> · Fri, 30 Oct 2015 10:37:51 +0200

It's pain, but not... :(We already used your updated lib in dev env... :(

2015-10-30 10:06 GMT+02:00 Wido den Hollander <wido@xxxxxxxx>:

On 29-10-15 16:38, Voloshanenko Igor wrote:

> Hi Wido and all community.

>

> We catched very idiotic issue on our Cloudstack installation, which

> related to ceph and possible to java-rados lib.

>

I think you ran into this one:

https://issues.apache.org/jira/browse/CLOUDSTACK-8879

Cleaning up RBD snapshots for volumes didn't go well and caused the JVM

to crash.

Wido

> So, we have constantly agent crashed (which cause very big problem for

> us... ).

>

> When agent crashed - it's crash JVM. And no event in logs at all.

> We enabled crush dump, and after crash we see next picture:

>

> #grep -A1 "Problematic frame" < /hs_err_pid30260.log

>  Problematic frame:

>  C  [librbd.so.1.0.0+0x5d681]

>

> # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core

> (gdb)  bt

> ...

> #7  0x00007f30b9a1fed2 in ceph::log::SubsystemMap::should_gather

> (level=<optimized out>, sub=<optimized out>, this=<optimized out>)

>     at ./log/SubsystemMap.h:62

> #8  0x00007f30b9a3b693 in ceph::log::SubsystemMap::should_gather

> (this=<optimized out>, sub=<optimized out>, level=<optimized out>)

>     at ./log/SubsystemMap.h:61

> #9  0x00007f30b9d879be in ObjectCacher::flusher_entry

> (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527

> #10 0x00007f30b9d9851d in ObjectCacher::FlusherThread::entry

> (this=<optimized out>) at osdc/ObjectCacher.h:374

>

> From ceph code, this part executed when flushing cache object... And we

> don;t understand why. Becasue we have absolutely different race

> condition to reproduce it.

>

> As cloudstack have not good implementation yet of snapshot lifecycle,

> sometime, it's happen, that some volumes already marked as EXPUNGED in

> DB and then cloudstack try to delete bas Volume, before it's try to

> unprotect it.

>

> Sure, unprotecting fail, normal exception returned back (fail because

> snap has childs... )

>

> 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]

> (Thread-1304:null) Executing:

> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh

> -i 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m

> /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11

> 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]

> (Thread-1304:null) Execution is successful.

> 2015-10-29 09:02:19,554 INFO  [kvm.storage.LibvirtStorageAdaptor]

> (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of

> image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the image

> 2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]

> (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at

> cephmon.anolim.net:6789 <http://cephmon.anolim.net:6789>

> 2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]

> (agentRequest-Handler-5:null) Unprotecting snapshot

> cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap

> 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]

> (agentRequest-Handler-5:null) Failed to delete volume:

> com.cloud.utils.exception.CloudRuntimeException:

> com.ceph.rbd.RbdException: Failed to unprotect snapshot cloudstack-base-snap

> 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]

> (agentRequest-Handler-5:null) Seq 4-1921583831:  { Ans: , MgmtId:

> 161344838950, via: 4, Ver: v1, Flags: 10,

> [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:

> com.ceph.rbd.RbdException: Failed to unprotect snapshot

> cloudstack-base-snap","wait":0}}] }

> 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]

> (agentRequest-Handler-2:null) Processing command:

> com.cloud.agent.api.GetHostStatsCommand

> 2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource]

> (agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n

> 1| awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo $idle

> 2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource]

> (agentRequest-Handler-2:null) Execution is successful.

> 2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource]

> (agentRequest-Handler-2:null) Executing: /bin/bash -c

> freeMem=$(free|grep cache:|awk '{print $4}');echo $freeMem

> 2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource]

> (agentRequest-Handler-2:null) Execution is successful.

>

> BUT, after 20 minutes - agent crashed... If we remove all childs and

> create conditions for cloudstack to delete volume - all OK - no agent

> crash in 20 minutes...

>

> We can't connect this action - Volume delete with agent crashe... Also

> we don't understand why +- 20 minutes need to last, and only then agent

> crashed...

>

> From logs, before crash - only GetVMStats... And then - agent started...

>

> 2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] (UgentTask-5:null)

> Sending ping: Seq 4-1343:  { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags:

> 11,

> [{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":"cs2.anolim.net

> <http://cs2.anolim.net>"},"i-7-106-VM":{"state":"PowerOn","host":"cs2.anolim.net

> <http://cs2.anolim.net>"},"i-1683-1984-VM":{"state":"PowerOn","host":"cs2.anolim.net

> <http://cs2.anolim.net>"},"i-11-504-VM":{"state":"PowerOn","host":"cs2.anolim.net

> <http://cs2.anolim.net>"},"i-325-616-VM":{"state":"PowerOn","host":"cs2.anolim.net

> <http://cs2.anolim.net>"},"i-10-52-VM":{"state":"PowerOn","host":"cs2.anolim.net

> <http://cs2.anolim.net>"},"i-941-1237-VM":{"state":"PowerOn","host":"cs2.anolim.net

> <http://cs2.anolim.net>"}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":4,"wait":0}}]

> }

> 2015-10-29 09:21:55,149 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null)

> Received response: Seq 4-1343:  { Ans: , MgmtId: 161344838950, via: 4,

> Ver: v1, Flags: 100010,

> [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":4,"wait":0},"result":true,"wait":0}}]

> }

> 2015-10-29 09:22:20,755 DEBUG [cloud.agent.Agent]

> (agentRequest-Handler-5:null) Processing command:

> com.cloud.agent.api.GetVmStatsCommand

> 2015-10-29 09:23:02,630 INFO  [cloud.agent.AgentShell] (main:null) Agent

> started

> 2015-10-29 09:23:02,632 INFO  [cloud.agent.AgentShell] (main:null)

> Implementation Version is 4.3.2

>

>

> Can you please give us direction to go deeper... As we dont udnerstand

> what happened...

>

> Issue reproducing very easy:

>

> (to speed up delete process we set storage.cleanup.interval=10)

>

> 1. Create Volume

> 2. Attach it to VM

> 3. Create snapshot from this volume by hand on ceph

> 4. Do snap_protect by hand on ceph for snap from previous step

> 5. Create child via snap clone

> 6. Deattach volume from VM and delete it

>

> Our problem - that this happen not in such lab situation, but in real

> life...

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com