Hi Kenneth, This problem is much like your last reported problem. It doesn't backport to 0.85, so the only master branch has no existing bug. On Tue, Sep 16, 2014 at 9:58 PM, Gregory Farnum <greg at inktank.com> wrote: > Heh, you'll have to talk to Haomai about issues with the > KeyValueStore, but I know he's found a number of issues in the version > of it that went to 0.85. > > In future please flag when you're running with experimental stuff; it > helps direct attention to the right places! ;) > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Tue, Sep 16, 2014 at 5:28 AM, Kenneth Waegeman > <Kenneth.Waegeman at ugent.be> wrote: >> >> ----- Message from Gregory Farnum <greg at inktank.com> --------- >> Date: Mon, 15 Sep 2014 10:37:07 -0700 >> From: Gregory Farnum <greg at inktank.com> >> Subject: Re: OSD troubles on FS+Tiering >> To: Kenneth Waegeman <Kenneth.Waegeman at ugent.be> >> Cc: ceph-users <ceph-users at lists.ceph.com> >> >> >>> The pidfile bug is already fixed in master/giant branches. >>> >>> As for the crashing, I'd try killing all the osd processes and turning >>> them back on again. It might just be some daemon restart failed, or >>> your cluster could be sufficiently overloaded that the node disks are >>> going unresponsive and they're suiciding, or... >> >> >> I restarted them that way, and they eventually got clean again. >> 'ceph status' printed that 'ecdata' pool had too few pgs, so I changed the >> amount of pgs from 128 to 256 (with EC k+m=11) >> After a few minutes I checked the cluster state again: >> >> [root at ceph001 ~]# ceph status >> cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d >> health HEALTH_WARN 100 pgs down; 155 pgs peering; 81 pgs stale; 240 pgs >> stuck inactive; 81 pgs stuck stale; 240 pgs stuck unclean; 746 requests are >> blocked > 32 sec; 'cache' at/near target max; pool ecdata pg_num 256 > >> pgp_num 128 >> monmap e1: 3 mons at >> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, >> election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003 >> mdsmap e6993: 1/1/1 up {0=ceph003=up:active}, 2 up:standby >> osdmap e11023: 48 osds: 14 up, 14 in >> pgmap v160466: 1472 pgs, 4 pools, 3899 GB data, 2374 kobjects >> 624 GB used, 7615 GB / 8240 GB avail >> 75 creating >> 1215 active+clean >> 100 down+peering >> 1 active+clean+scrubbing >> 10 stale >> 16 stale+active+clean >> >> Again 34 OSDS are down.. This time I have the error log, I checked a few osd >> logs : >> >> I checked the first host that was marked down: >> >> -17> 2014-09-16 13:27:49.962938 7f5dfe6a3700 5 osd.7 pg_epoch: 8912 >> pg[2.b0s3(unlocked)] enter Initial >> -16> 2014-09-16 13:27:50.008842 7f5e02eac700 1 -- >> 10.143.8.180:6833/53810 <== osd.30 10.141.8.181:0/37396 2524 ==== >> osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2 ==== 47+0+0 >> (3868888299 0 0) 0x18ef7080 con 0x6961600 >> -15> 2014-09-16 13:27:50.008892 7f5e02eac700 1 -- >> 10.143.8.180:6833/53810 --> 10.141.8.181:0/37396 -- osd_ping(ping_reply >> e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x7326900 con 0x6961600 >> -14> 2014-09-16 13:27:50.009159 7f5e046af700 1 -- >> 10.141.8.180:6847/53810 <== osd.30 10.141.8.181:0/37396 2524 ==== >> osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2 ==== 47+0+0 >> (3868888299 0 0) 0x2210a760 con 0xadd0420 >> -13> 2014-09-16 13:27:50.009202 7f5e046af700 1 -- >> 10.141.8.180:6847/53810 --> 10.141.8.181:0/37396 -- osd_ping(ping_reply >> e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x14e35a00 con 0xadd0420 >> -12> 2014-09-16 13:27:50.034378 7f5dfeea4700 5 osd.7 pg_epoch: 8912 >> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 >> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 >> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Reset 0.127612 1 >> 0.000123 >> -11> 2014-09-16 13:27:50.034432 7f5dfeea4700 5 osd.7 pg_epoch: 8912 >> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 >> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 >> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Started >> -10> 2014-09-16 13:27:50.034452 7f5dfeea4700 5 osd.7 pg_epoch: 8912 >> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 >> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 >> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Start >> -9> 2014-09-16 13:27:50.034469 7f5dfeea4700 1 osd.7 pg_epoch: 8912 >> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 >> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 >> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] state<Start>: transitioning >> to Stray >> -8> 2014-09-16 13:27:50.034491 7f5dfeea4700 5 osd.7 pg_epoch: 8912 >> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 >> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 >> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Start 0.000038 0 >> 0.000000 >> -7> 2014-09-16 13:27:50.034521 7f5dfeea4700 5 osd.7 pg_epoch: 8912 >> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104 >> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912 >> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Started/Stray >> -6> 2014-09-16 13:27:50.034664 7f5dfeea4700 5 osd.7 pg_epoch: 8912 >> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104 >> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814 >> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit >> Started/ReplicaActive/RepNotRecovering 7944.878905 22472 0.038180 >> -5> 2014-09-16 13:27:50.034689 7f5dfeea4700 5 osd.7 pg_epoch: 8912 >> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104 >> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814 >> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit Started/ReplicaActive >> 7944.878946 0 0.000000 >> -4> 2014-09-16 13:27:50.034711 7f5dfeea4700 5 osd.7 pg_epoch: 8912 >> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104 >> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814 >> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit Started 7945.924923 0 >> 0.000000 >> -3> 2014-09-16 13:27:50.034732 7f5dfeea4700 5 osd.7 pg_epoch: 8912 >> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104 >> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814 >> pi=104-813/36 luod=0'0 crt=8885'35261 active] enter Reset >> -2> 2014-09-16 13:27:50.034869 7f5dfeea4700 5 osd.7 pg_epoch: 8912 >> pg[2.87s10(unlocked)] enter Initial >> -1> 2014-09-16 13:27:50.042055 7f5e11981700 5 osd.7 8912 tick >> 0> 2014-09-16 13:27:50.045856 7f5e1015f700 -1 *** Caught signal >> (Aborted) ** >> in thread 7f5e1015f700 >> >> ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187) >> 1: /usr/bin/ceph-osd() [0xa72096] >> 2: (()+0xf130) [0x7f5e193d7130] >> 3: (gsignal()+0x39) [0x7f5e17dd5989] >> 4: (abort()+0x148) [0x7f5e17dd7098] >> 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5e186e89d5] >> 6: (()+0x5e946) [0x7f5e186e6946] >> 7: (()+0x5e973) [0x7f5e186e6973] >> 8: (()+0x5eb9f) [0x7f5e186e6b9f] >> 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x1ef) [0xb5e58f] >> 10: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int, >> std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x45e) >> [0xa3b1ee] >> 11: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int, >> snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*, >> ghobject_t*)+0x274) [0x9042d4] >> 12: (KeyValueStore::_split_collection(coll_t, unsigned int, unsigned int, >> coll_t, KeyValueStore::BufferTransaction&)+0x421) [0x91f091] >> 13: (KeyValueStore::_do_transaction(ObjectStore::Transaction&, >> KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0xa4c) [0x920f2c] >> 14: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*, >> std::allocator<ObjectStore::Transaction*> >&, unsigned long, >> ThreadPool::TPHandle*)+0x13f) [0x92385f] >> 15: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*, >> ThreadPool::TPHandle&)+0xac) [0x923a7c] >> 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xb4ef50] >> 17: (ThreadPool::WorkThread::entry()+0x10) [0xb50040] >> 18: (()+0x7df3) [0x7f5e193cfdf3] >> 19: (clone()+0x6d) [0x7f5e17e963dd] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> >> On some hosts that followed the crashing : >> >> =0 cs=0 l=1 c=0x139ed280).connect error 10.141.8.181:6822/36408, (111) >> Connection refused >> -7> 2014-09-16 13:28:36.028858 7f74eb007700 2 -- 10.141.8.180:0/52318 >>>> 10.141.8.181:6822/36408 pipe(0x18147bc0 sd=41 :0 s=1 pgs >> =0 cs=0 l=1 c=0x139ed280).fault (111) Connection refused >> -6> 2014-09-16 13:28:36.029423 7f74e4c96700 2 -- 10.141.8.180:0/52318 >>>> 10.143.8.181:6815/36408 pipe(0x18147640 sd=81 :0 s=1 pgs >> =0 cs=0 l=1 c=0x139e91e0).connect error 10.143.8.181:6815/36408, (111) >> Connection refused >> -5> 2014-09-16 13:28:36.029443 7f74e4c96700 2 -- 10.141.8.180:0/52318 >>>> 10.143.8.181:6815/36408 pipe(0x18147640 sd=81 :0 s=1 pgs >> =0 cs=0 l=1 c=0x139e91e0).fault (111) Connection refused >> -4> 2014-09-16 13:28:36.101914 7f7509534700 1 -- >> 10.143.8.180:6801/52318 <== osd.32 10.141.8.182:0/54784 2520 ==== >> osd_ping(ping >> e8964 stamp 2014-09-16 13:28:36.101604) v2 ==== 47+0+0 (411091961 0 0) >> 0x189b50a0 con 0x14a0f7a0 >> -3> 2014-09-16 13:28:36.101952 7f7509534700 1 -- >> 10.143.8.180:6801/52318 --> 10.141.8.182:0/54784 -- osd_ping(ping_reply >> e8941 st >> amp 2014-09-16 13:28:36.101604) v2 -- ?+0 0x1a0feea0 con 0x14a0f7a0 >> -2> 2014-09-16 13:28:36.101950 7f750ad37700 1 -- >> 10.141.8.180:6801/52318 <== osd.32 10.141.8.182:0/54784 2520 ==== >> osd_ping(ping >> e8964 stamp 2014-09-16 13:28:36.101604) v2 ==== 47+0+0 (411091961 0 0) >> 0x1178cce0 con 0x143944c0 >> -1> 2014-09-16 13:28:36.102005 7f750ad37700 1 -- >> 10.141.8.180:6801/52318 --> 10.141.8.182:0/54784 -- osd_ping(ping_reply >> e8941 st >> amp 2014-09-16 13:28:36.101604) v2 -- ?+0 0x14b0f440 con 0x143944c0 >> 0> 2014-09-16 13:28:36.183818 7f751681f700 -1 os/GenericObjectMap.cc: >> In function 'int GenericObjectMap::list_objects(const coll_ >> t&, ghobject_t, int, std::vector<ghobject_t>*, ghobject_t*)' thread >> 7f751681f700 time 2014-09-16 13:28:36.181333 >> os/GenericObjectMap.cc: 1094: FAILED assert(start <= header.oid) >> >> ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187) >> 1: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int, >> std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t* >> )+0x45e) [0xa3b1ee] >> 2: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int, >> snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >> >*, ghobject_t*)+0x274) [0x9042d4] >> 3: (KeyValueStore::_split_collection(coll_t, unsigned int, unsigned int, >> coll_t, KeyValueStore::BufferTransaction&)+0x421) [0x91f091] >> 4: (KeyValueStore::_do_transaction(ObjectStore::Transaction&, >> KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0xa4c) [0x920 >> f2c] >> 5: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*, >> std::allocator<ObjectStore::Transaction*> >&, unsigned long, >> ThreadPool::TPHandle*)+0x13f) [0x92385f] >> 6: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*, >> ThreadPool::TPHandle&)+0xac) [0x923a7c] >> 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xb4ef50] >> 8: (ThreadPool::WorkThread::entry()+0x10) [0xb50040] >> 9: (()+0x7df3) [0x7f7520317df3] >> 10: (clone()+0x6d) [0x7f751edde3dd] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> >> I tried to restart the crashed OSDS again, but it always fails instantly now >> with the above stacktrace .. >> >> Any ideas with this? >> Thanks a lot! >> Kenneth >> >> >>> -Greg >>> Software Engineer #42 @ http://inktank.com | http://ceph.com >>> >>> >>> On Mon, Sep 15, 2014 at 5:43 AM, Kenneth Waegeman >>> <Kenneth.Waegeman at ugent.be> wrote: >>>> >>>> Hi, >>>> >>>> I have some strange OSD problems. Before the weekend I started some rsync >>>> tests over CephFS, on a cache pool with underlying EC KV pool. Today the >>>> cluster is completely degraded: >>>> >>>> [root at ceph003 ~]# ceph status >>>> cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d >>>> health HEALTH_WARN 19 pgs backfill_toofull; 403 pgs degraded; 168 >>>> pgs >>>> down; 8 pgs incomplete; 168 pgs peering; 61 pgs stale; 403 pgs stuck >>>> degraded; 176 pgs stuck inactive; 61 pgs stuck stale; 589 pgs stuck >>>> unclean; >>>> 403 pgs stuck undersized; 403 pgs undersized; 300 requests are blocked > >>>> 32 >>>> sec; recovery 15170/27902361 objects degraded (0.054%); 1922/27902361 >>>> objects misplaced (0.007%); 1 near full osd(s) >>>> monmap e1: 3 mons at >>>> >>>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, >>>> election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003 >>>> mdsmap e5: 1/1/1 up {0=ceph003=up:active}, 2 up:standby >>>> osdmap e719: 48 osds: 18 up, 18 in >>>> pgmap v144887: 1344 pgs, 4 pools, 4139 GB data, 2624 kobjects >>>> 2282 GB used, 31397 GB / 33680 GB avail >>>> 15170/27902361 objects degraded (0.054%); 1922/27902361 >>>> objects >>>> misplaced (0.007%) >>>> 68 down+remapped+peering >>>> 1 active >>>> 754 active+clean >>>> 1 stale+incomplete >>>> 1 stale+active+clean+scrubbing >>>> 14 active+undersized+degraded+remapped >>>> 7 incomplete >>>> 100 down+peering >>>> 9 active+remapped >>>> 59 stale+active+undersized+degraded >>>> 19 active+undersized+degraded+remapped+backfill_toofull >>>> 311 active+undersized+degraded >>>> >>>> I tried to figure out what happened in the global logs: >>>> >>>> 2014-09-13 08:01:19.433313 mon.0 10.141.8.180:6789/0 66076 : [INF] pgmap >>>> v65892: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB / >>>> 129 TB avail; 4159 kB/s wr, 45 op/s >>>> 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap >>>> v65893: 1344 pgs: 1344 >>>> 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap >>>> v65893: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB / >>>> 129 TB avail; 561 kB/s wr, 11 op/s >>>> 2014-09-13 08:01:20.777988 mon.0 10.141.8.180:6789/0 66081 : [INF] osd.19 >>>> 10.141.8.181:6809/29664 failed (3 reports from 3 peers after 20.000079 >= >>>> grace 20.000000) >>>> 2014-09-13 08:01:21.455887 mon.0 10.141.8.180:6789/0 66083 : [INF] osdmap >>>> e117: 48 osds: 47 up, 48 in >>>> 2014-09-13 08:01:21.462084 mon.0 10.141.8.180:6789/0 66084 : [INF] pgmap >>>> v65894: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB / >>>> 129 TB avail; 1353 kB/s wr, 13 op/s >>>> 2014-09-13 08:01:21.477007 mon.0 10.141.8.180:6789/0 66085 : [INF] pgmap >>>> v65895: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB >>>> data, >>>> 3116 GB used, 126 TB / 129 TB avail; 2300 kB/s wr, 21 op/s >>>> 2014-09-13 08:01:22.456055 mon.0 10.141.8.180:6789/0 66086 : [INF] osdmap >>>> e118: 48 osds: 47 up, 48 in >>>> 2014-09-13 08:01:22.462590 mon.0 10.141.8.180:6789/0 66087 : [INF] pgmap >>>> v65896: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB >>>> data, >>>> 3116 GB used, 126 TB / 129 TB avail; 13686 kB/s wr, 5 op/s >>>> 2014-09-13 08:01:23.464302 mon.0 10.141.8.180:6789/0 66088 : [INF] pgmap >>>> v65897: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB >>>> data, >>>> 3116 GB used, 126 TB / 129 TB avail; 11075 kB/s wr, 4 op/s >>>> 2014-09-13 08:01:24.477467 mon.0 10.141.8.180:6789/0 66089 : [INF] pgmap >>>> v65898: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB >>>> data, >>>> 3116 GB used, 126 TB / 129 TB avail; 4932 kB/s wr, 38 op/s >>>> 2014-09-13 08:01:25.481027 mon.0 10.141.8.180:6789/0 66090 : [INF] pgmap >>>> v65899: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB >>>> data, >>>> 3116 GB used, 126 TB / 129 TB avail; 5726 kB/s wr, 64 op/s >>>> 2014-09-13 08:01:19.336173 osd.1 10.141.8.180:6803/26712 54442 : [WRN] 1 >>>> slow requests, 1 included below; oldest blocked for > 30.000137 secs >>>> 2014-09-13 08:01:19.336341 osd.1 10.141.8.180:6803/26712 54443 : [WRN] >>>> slow >>>> request 30.000137 seconds old, received at 2014-09-13 08:00:49.335339: >>>> osd_op(client.7448.1:17751783 10000203eac.0000000e [write 0~319488 >>>> [1 at -1],startsync 0~0] 1.b >>>> 6c3a3a9 snapc 1=[] ondisk+write e116) currently reached pg >>>> 2014-09-13 08:01:20.337602 osd.1 10.141.8.180:6803/26712 54444 : [WRN] 7 >>>> slow requests, 6 included below; oldest blocked for > 31.001947 secs >>>> 2014-09-13 08:01:20.337688 osd.1 10.141.8.180:6803/26712 54445 : [WRN] >>>> slow >>>> request 30.998110 seconds old, received at 2014-09-13 08:00:49.339176: >>>> osd_op(client.7448.1:17751787 10000203eac.0000000e [write 319488~65536 >>>> [1 at -1],startsync 0~0] >>>> >>>> >>>> This is happening OSD after OSD.. >>>> >>>> I tried to check the individual log of the osds, but all the individual >>>> logs >>>> stop abruptly (also from the osds that are still running): >>>> >>>> 2014-09-12 14:25:51.205276 7f3517209700 0 log [WRN] : 41 slow requests, >>>> 1 >>>> included below; oldest blocked for > 38.118088 secs >>>> 2014-09-12 14:25:51.205337 7f3517209700 0 log [WRN] : slow request >>>> 36.558286 seconds old, received at 2014-09-12 14:25:14.646836: >>>> osd_op(client.7448.1:2458392 1000006328f.0000000b [write 3989504~204800 >>>> [1 at -1],startsync 0~0] 1.9337bf4b snapc 1=[] ondisk+write e116) currently >>>> reached pg >>>> 2014-09-12 14:25:53.205586 7f3517209700 0 log [WRN] : 30 slow requests, >>>> 1 >>>> included below; oldest blocked for > 40.118530 secs >>>> 2014-09-12 14:25:53.205679 7f3517209700 0 log [WRN] : slow request >>>> 30.541026 seconds old, received at 2014-09-12 14:25:22.664538: >>>> osd_op(client.7448.1:2460291 100000632b7.00000000 [write 0~691 >>>> [1 at -1],startsync 0~0] 1.994248a8 snapc 1=[] ondisk+write e116) currently >>>> reached pg >>>> 2014-09-12 17:52:40.503917 7f34e8ed2700 0 -- 10.141.8.181:6809/29664 >> >>>> 10.141.8.181:6847/62389 pipe(0x247ce040 sd=327 :6809 s=0 pgs=0 cs=0 l=1 >>>> c=0x1bc8b9c0).accept replacing existing (lossy) channel (new one lossy=1) >>>> >>>> I *think* the absence of the logs is some issue related to another issue >>>> I >>>> just found (http://tracker.ceph.com/issues/9470). >>>> >>>> So I can't found out the original problem with the log files.. >>>> >>>> Is there any other way I can find out what started the crashing of 30 >>>> osds ? >>>> >>>> Thanks!! >>>> >>>> Kenneth >>>> >>>> _______________________________________________ >>>> ceph-users mailing list >>>> ceph-users at lists.ceph.com >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> ----- End message from Gregory Farnum <greg at inktank.com> ----- >> >> -- >> >> Met vriendelijke groeten, >> Kenneth Waegeman >> >> > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Best Regards, Wheat