From all we analyzed - look like - it's this issue
Can anyone help us to confirm this? :)
2015-10-29 23:13 GMT+02:00 Voloshanenko Igor <igor.voloshanenko@xxxxxxxxx>:
Additional trace:#0 0x00007f30f9891cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56#1 0x00007f30f98950d8 in __GI_abort () at abort.c:89#2 0x00007f30f87b36b5 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6#3 0x00007f30f87b1836 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6#4 0x00007f30f87b1863 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6#5 0x00007f30f87b1aa2 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6#6 0x00007f2fddb50778 in ceph::__ceph_assert_fail (assertion=assertion@entry=0x7f2fdddeca05 "sub < m_subsys.size()",file=file@entry=0x7f2fdddec9f0 "./log/SubsystemMap.h", line=line@entry=62,func=func@entry=0x7f2fdddedba0 <_ZZN4ceph3log12SubsystemMap13should_gatherEjiE19__PRETTY_FUNCTION__> "bool ceph::log::SubsystemMap::should_gather(unsigned int, int)") at common/assert.cc:77#7 0x00007f2fdda1fed2 in ceph::log::SubsystemMap::should_gather (level=<optimized out>, sub=<optimized out>, this=<optimized out>)at ./log/SubsystemMap.h:62#8 0x00007f2fdda3b693 in ceph::log::SubsystemMap::should_gather (this=<optimized out>, sub=<optimized out>, level=<optimized out>)at ./log/SubsystemMap.h:61#9 0x00007f2fddd879be in ObjectCacher::flusher_entry (this=0x7f2ff80b27a0) at osdc/ObjectCacher.cc:1527#10 0x00007f2fddd9851d in ObjectCacher::FlusherThread::entry (this=<optimized out>) at osdc/ObjectCacher.h:374#11 0x00007f30f9c28182 in start_thread (arg=0x7f2e1a7fc700) at pthread_create.c:312#12 0x00007f30f995547d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:1112015-10-29 17:38 GMT+02:00 Voloshanenko Igor <igor.voloshanenko@xxxxxxxxx>:Hi Wido and all community.We catched very idiotic issue on our Cloudstack installation, which related to ceph and possible to java-rados lib.So, we have constantly agent crashed (which cause very big problem for us... ).When agent crashed - it's crash JVM. And no event in logs at all.We enabled crush dump, and after crash we see next picture:#grep -A1 "Problematic frame" < /hs_err_pid30260.logProblematic frame:C [librbd.so.1.0.0+0x5d681]# gdb /usr/lib/librbd.so.1.0.0 /var/tmp/cores/jsvc.25526.0.core(gdb) bt...#7 0x00007f30b9a1fed2 in ceph::log::SubsystemMap::should_gather (level=<optimized out>, sub=<optimized out>, this=<optimized out>)at ./log/SubsystemMap.h:62#8 0x00007f30b9a3b693 in ceph::log::SubsystemMap::should_gather (this=<optimized out>, sub=<optimized out>, level=<optimized out>)at ./log/SubsystemMap.h:61#9 0x00007f30b9d879be in ObjectCacher::flusher_entry (this=0x7f2fb4017910) at osdc/ObjectCacher.cc:1527#10 0x00007f30b9d9851d in ObjectCacher::FlusherThread::entry (this=<optimized out>) at osdc/ObjectCacher.h:374From ceph code, this part executed when flushing cache object... And we don;t understand why. Becasue we have absolutely different race condition to reproduce it.As cloudstack have not good implementation yet of snapshot lifecycle, sometime, it's happen, that some volumes already marked as EXPUNGED in DB and then cloudstack try to delete bas Volume, before it's try to unprotect it.Sure, unprotecting fail, normal exception returned back (fail because snap has childs... )2015-10-29 09:02:19,401 DEBUG [kvm.resource.KVMHAMonitor] (Thread-1304:null) Executing: /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/kvmheartbeat.sh -i 10.44.253.13 -p /var/lib/libvirt/PRIMARY -m /mnt/93655746-a9ef-394d-95e9-6e62471dd39f -h 10.44.253.112015-10-29 09:02:19,412 DEBUG [kvm.resource.KVMHAMonitor] (Thread-1304:null) Execution is successful.2015-10-29 09:02:19,554 INFO [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) Unprotecting and Removing RBD snapshots of image 6789/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c prior to removing the image2015-10-29 09:02:19,571 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) Succesfully connected to Ceph cluster at cephmon.anolim.net:67892015-10-29 09:02:19,608 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-5:null) Unprotecting snapshot cloudstack/71b1e2e9-1985-45ca-9ab6-9e5016b86b7c@cloudstack-base-snap2015-10-29 09:02:19,627 DEBUG [kvm.storage.KVMStorageProcessor] (agentRequest-Handler-5:null) Failed to delete volume: com.cloud.utils.exception.CloudRuntimeException: com.ceph.rbd.RbdException: Failed to unprotect snapshot cloudstack-base-snap2015-10-29 09:02:19,628 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) Seq 4-1921583831: { Ans: , MgmtId: 161344838950, via: 4, Ver: v1, Flags: 10, [{"com.cloud.agent.api.Answer":{"result":false,"details":"com.cloud.utils.exception.CloudRuntimeException: com.ceph.rbd.RbdException: Failed to unprotect snapshot cloudstack-base-snap","wait":0}}] }2015-10-29 09:02:25,722 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) Processing command: com.cloud.agent.api.GetHostStatsCommand2015-10-29 09:02:25,722 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /bin/bash -c idle=$(top -b -n 1| awk -F, '/^[%]*[Cc]pu/{$0=$4; gsub(/[^0-9.,]+/,""); print }'); echo $idle2015-10-29 09:02:26,249 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Execution is successful.2015-10-29 09:02:26,250 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Executing: /bin/bash -c freeMem=$(free|grep cache:|awk '{print $4}');echo $freeMem2015-10-29 09:02:26,254 DEBUG [kvm.resource.LibvirtComputingResource] (agentRequest-Handler-2:null) Execution is successful.BUT, after 20 minutes - agent crashed... If we remove all childs and create conditions for cloudstack to delete volume - all OK - no agent crash in 20 minutes...We can't connect this action - Volume delete with agent crashe... Also we don't understand why +- 20 minutes need to last, and only then agent crashed...From logs, before crash - only GetVMStats... And then - agent started...2015-10-29 09:21:55,143 DEBUG [cloud.agent.Agent] (UgentTask-5:null) Sending ping: Seq 4-1343: { Cmd , MgmtId: -1, via: 4, Ver: v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingCommand":{"newStates":{},"_hostVmStateReport":{"i-881-1117-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-7-106-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-1683-1984-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-11-504-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-325-616-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-10-52-VM":{"state":"PowerOn","host":"cs2.anolim.net"},"i-941-1237-VM":{"state":"PowerOn","host":"cs2.anolim.net"}},"_gatewayAccessible":true,"_vnetAccessible":true,"hostType":"Routing","hostId":4,"wait":0}}] }2015-10-29 09:21:55,149 DEBUG [cloud.agent.Agent] (Agent-Handler-3:null) Received response: Seq 4-1343: { Ans: , MgmtId: 161344838950, via: 4, Ver: v1, Flags: 100010, [{"com.cloud.agent.api.PingAnswer":{"_command":{"hostType":"Routing","hostId":4,"wait":0},"result":true,"wait":0}}] }2015-10-29 09:22:20,755 DEBUG [cloud.agent.Agent] (agentRequest-Handler-5:null) Processing command: com.cloud.agent.api.GetVmStatsCommand2015-10-29 09:23:02,630 INFO [cloud.agent.AgentShell] (main:null) Agent started2015-10-29 09:23:02,632 INFO [cloud.agent.AgentShell] (main:null) Implementation Version is 4.3.2Can you please give us direction to go deeper... As we dont udnerstand what happened...Issue reproducing very easy:(to speed up delete process we set storage.cleanup.interval=10)1. Create Volume2. Attach it to VM3. Create snapshot from this volume by hand on ceph4. Do snap_protect by hand on ceph for snap from previous step5. Create child via snap clone6. Deattach volume from VM and delete itOur problem - that this happen not in such lab situation, but in real life...
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com