Re: Cloudstack agent crashed JVM with exception in librbd

Voloshanenko Igor <igor.voloshanenko@xxxxxxxxx> · Tue, 3 Nov 2015 02:54:23 +0200

Thank you, Jason!
Any advice, for troubleshooting....

I'm looking in code, and right now don;t see any bad things :(

2015-11-03 1:32 GMT+02:00 Jason Dillaman <dillaman@xxxxxxxxxx>:
Most likely not going to be related to 13045 since you aren't actively exporting an image diff.  The most likely problem is that the RADOS IO context is being closed prior to closing the RBD image.

--

Jason Dillaman

----- Original Message -----

> From: "Voloshanenko Igor" <igor.voloshanenko@xxxxxxxxx>

> To: "Ceph Users" <ceph-users@xxxxxxxxxxxxxx>

> Sent: Thursday, October 29, 2015 5:27:17 PM

> Subject: Re:  Cloudstack agent crashed JVM with exception in

> librbd

> From all we analyzed - look like - it's this issue

> http://tracker.ceph.com/issues/13045

> PR: https://github.com/ceph/ceph/pull/6097

> Can anyone help us to confirm this? :)

> 2015-10-29 23:13 GMT+02:00 Voloshanenko Igor < igor.voloshanenko@xxxxxxxxx >

> :

> > Additional trace:

>

> > #0 0x00007f30f9891cc9 in __GI_raise (sig=sig@entry=6) at

> > ../nptl/sysdeps/unix/sysv/linux/raise.c:56

>

> > #1 0x00007f30f98950d8 in __GI_abort () at abort.c:89

>

> > #2 0x00007f30f87b36b5 in __gnu_cxx::__verbose_terminate_handler() () from

> > /usr/lib/x86_64-linux-gnu/libstdc++.so.6

>

> > #3 0x00007f30f87b1836 in ?? () from

> > /usr/lib/x86_64-linux-gnu/libstdc++.so.6

>

> > #4 0x00007f30f87b1863 in std::terminate() () from

> > /usr/lib/x86_64-linux-gnu/libstdc++.so.6

>

> > #5 0x00007f30f87b1aa2 in __cxa_throw () from

> > /usr/lib/x86_64-linux-gnu/libstdc++.so.6

>

> > #6 0x00007f2fddb50778 in ceph::__ceph_assert_fail

> > (assertion=assertion@entry=0x7f2fdddeca05 "sub < m_subsys.size()",

>

> > file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h", line=line@entry=62,

>

> > func=func@entry=0x7f2fdddedba0

> > <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool

> > ceph::log::SubsystemMap::should_gather(unsigned int, int)") at

> > common/assert.cc:77

>

> > #7 0x00007f2fdda1fed2 in ceph::log::SubsystemMap::should_gather

> > (level=<optimized out>, sub=<optimized out>, this=<optimized out>)

>

> > at ./log/SubsystemMap.h:62

>

> > #8 0x00007f2fdda3b693 in ceph::log::SubsystemMap::should_gather

> > (this=<optimized out>, sub=<optimized out>, level=<optimized out>)

>

> > at ./log/SubsystemMap.h:61

>

> > #9 0x00007f2fddd879be in ObjectCacher::flusher_entry (this=0x7f2ff80b27a0)

> > at

> > osdc/ObjectCacher.cc:1527

>

> > #10 0x00007f2fddd9851d in ObjectCacher::FlusherThread::entry

> > (this=<optimized

> > out>) at osdc/ObjectCacher.h:374

>

> > #11 0x00007f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at

> > pthread_create.c:312

>

> > #12 0x00007f30f995547d in clone () at

> > ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

>

> > 2015-10-29 17:38 GMT+02:00 Voloshanenko Igor < igor.voloshanenko@xxxxxxxxx

> > >

> > :

>

> > > Hi Wido and all community.

> >

>

> > > We catched very idiotic issue on our Cloudstack installation, which

> > > related

> > > to ceph and possible to java-rados lib.

> >

>

> > > So, we have constantly agent crashed (which cause very big problem for

> > > us...

> > > ).

> >

>

> > > When agent crashed - it's crash JVM. And no event in logs at all.

> >

>

> > > We enabled crush dump, and after crash we see next picture:

> >

>

> > > #grep -A1 "Problematic frame" < /hs_err_pid30260.log

> >

>

> > > Problematic frame:

> >

>

> > > C [librbd.so.1.0.0+0x5d681]

> >

>

> > > # gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core

> >

>

> > > (gdb) bt

> >

>

> > > ...

> >

>

> > > #7 0x00007f30b9a1fed2 in ceph::log::SubsystemMap::should_gather

> > > (level=<optimized out>, sub=<optimized out>, this=<optimized out>)

> >

>

> > > at ./log/SubsystemMap.h:62

> >

>

> > > #8 0x00007f30b9a3b693 in ceph::log::SubsystemMap::should_gather

> > > (this=<optimized out>, sub=<optimized out>, level=<optimized out>)

> >

>

> > > at ./log/SubsystemMap.h:61

> >

>

> > > #9 0x00007f30b9d879be in ObjectCacher::flusher_entry

> > > (this=0x7f2fb4017910)

> > > at

> > > osdc/ObjectCacher.cc:1527

> >

>

> > > #10 0x00007f30b9d9851d in ObjectCacher::FlusherThread::entry

> > > (this=<optimized

> > > out>) at osdc/ObjectCacher.h:374

> >

>

> > > From ceph code, this part executed when flushing cache object... And we

> > > don;t

> > > understand why. Becasue we have absolutely different race condition to

> > > reproduce it.

> >

>

> > > As cloudstack have not good implementation yet of snapshot lifecycle,

> > > sometime, it's happen, that some volumes already marked as EXPUNGED in DB

> > > and then cloudstack try to delete bas Volume, before it's try to

> > > unprotect

> > > it.

> >

>

> > > Sure, unprotecting fail, normal exception returned back (fail because

> > > snap

> > > has childs... )

> >

>

> > > 2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor]

> > > (Thread-1304:null)

> > > Executing:

> > > /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i

> > > 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m

> > > /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.11

> >

>

> > > 2015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor]

> > > (Thread-1304:null)

> > > Execution is successful.

> >

>

> > > 2015-10-29 09:02:19,554 INFO [kvm.storage.LibvirtStorageAdaptor]

> > > (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of

> > > image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the

> > > image

> >

>

> > > 2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor]

> > > (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at

> > > cephmon.anolim.net:6789

> >

>

> > > 2015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor]

> > > (agentRequest-Handler-5:null) Unprotecting snapshot

> > > cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap

> >

>

> > > 2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor]

> > > (agentRequest-Handler-5:null) Failed to delete volume:

> > > com.cloud.utils.exception.CloudRuntimeException:

> > > com.ceph.rbd.RbdException:

> > > Failed to unprotect snapshot cloudstack-base-snap

> >

>

> > > 2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent]

> > > (agentRequest-Handler-5:null) Seq 4-1921583831: { Ans: , MgmtId:

> > > 161344838950, via: 4, Ver: v1, Flags: 10,

> > > [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException:

> > > com.ceph.rbd.RbdException: Failed to unprotect snapshot

> > > cloudstack-base-snap","wait":0}}] }

> >

>

> > > 2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent]

> > > (agentRequest-Handler-2:null) Processing command:

> > > com.cloud.agent.api.GetHostStatsCommand

> >

>

> > > 2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource]

> > > (agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n 1|

> > > awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo $idle

> >

>

> > > 2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource]

> > > (agentRequest-Handler-2:null) Execution is successful.

> >

>

> > > 2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource]

> > > (agentRequest-Handler-2:null) Executing: /bin/bash -c freeMem=$(free|grep

> > > cache:|awk '{print $4}');echo $freeMem

> >

>

> > > 2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource]

> > > (agentRequest-Handler-2:null) Execution is successful.

> >

>

> > > BUT, after 20 minutes - agent crashed... If we remove all childs and

> > > create

> > > conditions for cloudstack to delete volume - all OK - no agent crash in

> > > 20

> > > minutes...

> >

>

> > > We can't connect this action - Volume delete with agent crashe... Also we

> > > don't understand why +- 20 minutes need to last, and only then agent

> > > crashed...

> >

>

> > > From logs, before crash - only GetVMStats... And then - agent started...

> >

>

> > > 2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] (UgentTask-5:null)

> > > Sending

> > > ping: Seq 4-1343: { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags: 11,

> > > [{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":"

> > > cs2.anolim.net "},"i-7-106-VM":{"state":"PowerOn","host":" cs2.anolim.net

> > > "},"i-1683-1984-VM":{"state":"PowerOn","host":" cs2.anolim.net

> > > "},"i-11-504-VM":{"state":"PowerOn","host":" cs2.anolim.net

> > > "},"i-325-616-VM":{"state":"PowerOn","host":" cs2.anolim.net

> > > "},"i-10-52-VM":{"state":"PowerOn","host":" cs2.anolim.net

> > > "},"i-941-1237-VM":{"state":"PowerOn","host":" cs2.anolim.net

> > > "}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":4,"wait":0}}]

> > > }

> >

>

> > > 2015-10-29 09:21:55,149 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null)

> > > Received response: Seq 4-1343: { Ans: , MgmtId: 161344838950, via: 4,

> > > Ver:

> > > v1, Flags: 100010,

> > > [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":4,"wait":0},"result":true,"wait":0}}]

> > > }

> >

>

> > > 2015-10-29 09:22:20,755 DEBUG [cloud.agent.Agent]

> > > (agentRequest-Handler-5:null) Processing command:

> > > com.cloud.agent.api.GetVmStatsCommand

> >

>

> > > 2015-10-29 09:23:02,630 INFO [cloud.agent.AgentShell] (main:null) Agent

> > > started

> >

>

> > > 2015-10-29 09:23:02,632 INFO [cloud.agent.AgentShell] (main:null)

> > > Implementation Version is 4.3.2

> >

>

> > > Can you please give us direction to go deeper... As we dont udnerstand

> > > what

> > > happened...

> >

>

> > > Issue reproducing very easy:

> >

>

> > > (to speed up delete process we set storage.cleanup.interval=10)

> >

>

> > > 1. Create Volume

> >

>

> > > 2. Attach it to VM

> >

>

> > > 3. Create snapshot from this volume by hand on ceph

> >

>

> > > 4. Do snap_protect by hand on ceph for snap from previous step

> >

>

> > > 5. Create child via snap clone

> >

>

> > > 6. Deattach volume from VM and delete it

> >

>

> > > Our problem - that this happen not in such lab situation, but in real

> > > life...

> >

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com