OSD troubles on FS+Tiering

greg@xxxxxxxxxxx (Gregory Farnum) · Tue, 16 Sep 2014 06:58:15 -0700

Heh, you'll have to talk to Haomai about issues with the
KeyValueStore, but I know he's found a number of issues in the version
of it that went to 0.85.

In future please flag when you're running with experimental stuff; it
helps direct attention to the right places! ;)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Tue, Sep 16, 2014 at 5:28 AM, Kenneth Waegeman
<Kenneth.Waegeman at ugent.be> wrote:
>
> ----- Message from Gregory Farnum <greg at inktank.com> ---------
>    Date: Mon, 15 Sep 2014 10:37:07 -0700
>    From: Gregory Farnum <greg at inktank.com>
> Subject: Re: OSD troubles on FS+Tiering
>      To: Kenneth Waegeman <Kenneth.Waegeman at ugent.be>
>      Cc: ceph-users <ceph-users at lists.ceph.com>
>
>
>> The pidfile bug is already fixed in master/giant branches.
>>
>> As for the crashing, I'd try killing all the osd processes and turning
>> them back on again. It might just be some daemon restart failed, or
>> your cluster could be sufficiently overloaded that the node disks are
>> going unresponsive and they're suiciding, or...
>
>
> I restarted them that way, and they eventually got clean again.
> 'ceph status' printed that 'ecdata' pool had too few pgs, so I changed the
> amount of pgs from 128 to 256 (with EC k+m=11)
> After a few minutes I checked the cluster state again:
>
> [root at ceph001 ~]# ceph status
>     cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>      health HEALTH_WARN 100 pgs down; 155 pgs peering; 81 pgs stale; 240 pgs
> stuck inactive; 81 pgs stuck stale; 240 pgs stuck unclean; 746 requests are
> blocked > 32 sec; 'cache' at/near target max; pool ecdata pg_num 256 >
> pgp_num 128
>      monmap e1: 3 mons at
> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
> election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003
>      mdsmap e6993: 1/1/1 up {0=ceph003=up:active}, 2 up:standby
>      osdmap e11023: 48 osds: 14 up, 14 in
>       pgmap v160466: 1472 pgs, 4 pools, 3899 GB data, 2374 kobjects
>             624 GB used, 7615 GB / 8240 GB avail
>                   75 creating
>                 1215 active+clean
>                  100 down+peering
>                    1 active+clean+scrubbing
>                   10 stale
>                   16 stale+active+clean
>
> Again 34 OSDS are down.. This time I have the error log, I checked a few osd
> logs :
>
> I checked the first host that was marked down:
>
>    -17> 2014-09-16 13:27:49.962938 7f5dfe6a3700  5 osd.7 pg_epoch: 8912
> pg[2.b0s3(unlocked)] enter Initial
>    -16> 2014-09-16 13:27:50.008842 7f5e02eac700  1 --
> 10.143.8.180:6833/53810 <== osd.30 10.141.8.181:0/37396 2524 ====
> osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2 ==== 47+0+0
> (3868888299 0 0) 0x18ef7080 con 0x6961600
>    -15> 2014-09-16 13:27:50.008892 7f5e02eac700  1 --
> 10.143.8.180:6833/53810 --> 10.141.8.181:0/37396 -- osd_ping(ping_reply
> e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x7326900 con 0x6961600
>    -14> 2014-09-16 13:27:50.009159 7f5e046af700  1 --
> 10.141.8.180:6847/53810 <== osd.30 10.141.8.181:0/37396 2524 ====
> osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2 ==== 47+0+0
> (3868888299 0 0) 0x2210a760 con 0xadd0420
>    -13> 2014-09-16 13:27:50.009202 7f5e046af700  1 --
> 10.141.8.180:6847/53810 --> 10.141.8.181:0/37396 -- osd_ping(ping_reply
> e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x14e35a00 con 0xadd0420
>    -12> 2014-09-16 13:27:50.034378 7f5dfeea4700  5 osd.7 pg_epoch: 8912
> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Reset 0.127612 1
> 0.000123
>    -11> 2014-09-16 13:27:50.034432 7f5dfeea4700  5 osd.7 pg_epoch: 8912
> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Started
>    -10> 2014-09-16 13:27:50.034452 7f5dfeea4700  5 osd.7 pg_epoch: 8912
> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Start
>     -9> 2014-09-16 13:27:50.034469 7f5dfeea4700  1 osd.7 pg_epoch: 8912
> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] state<Start>: transitioning
> to Stray
>     -8> 2014-09-16 13:27:50.034491 7f5dfeea4700  5 osd.7 pg_epoch: 8912
> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Start 0.000038 0
> 0.000000
>     -7> 2014-09-16 13:27:50.034521 7f5dfeea4700  5 osd.7 pg_epoch: 8912
> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Started/Stray
>     -6> 2014-09-16 13:27:50.034664 7f5dfeea4700  5 osd.7 pg_epoch: 8912
> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit
> Started/ReplicaActive/RepNotRecovering 7944.878905 22472 0.038180
>     -5> 2014-09-16 13:27:50.034689 7f5dfeea4700  5 osd.7 pg_epoch: 8912
> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit Started/ReplicaActive
> 7944.878946 0 0.000000
>     -4> 2014-09-16 13:27:50.034711 7f5dfeea4700  5 osd.7 pg_epoch: 8912
> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit Started 7945.924923 0
> 0.000000
>     -3> 2014-09-16 13:27:50.034732 7f5dfeea4700  5 osd.7 pg_epoch: 8912
> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
> pi=104-813/36 luod=0'0 crt=8885'35261 active] enter Reset
>     -2> 2014-09-16 13:27:50.034869 7f5dfeea4700  5 osd.7 pg_epoch: 8912
> pg[2.87s10(unlocked)] enter Initial
>     -1> 2014-09-16 13:27:50.042055 7f5e11981700  5 osd.7 8912 tick
>      0> 2014-09-16 13:27:50.045856 7f5e1015f700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f5e1015f700
>
>  ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)
>  1: /usr/bin/ceph-osd() [0xa72096]
>  2: (()+0xf130) [0x7f5e193d7130]
>  3: (gsignal()+0x39) [0x7f5e17dd5989]
>  4: (abort()+0x148) [0x7f5e17dd7098]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5e186e89d5]
>  6: (()+0x5e946) [0x7f5e186e6946]
>  7: (()+0x5e973) [0x7f5e186e6973]
>  8: (()+0x5eb9f) [0x7f5e186e6b9f]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x1ef) [0xb5e58f]
>  10: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int,
> std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x45e)
> [0xa3b1ee]
>  11: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int,
> snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*,
> ghobject_t*)+0x274) [0x9042d4]
>  12: (KeyValueStore::_split_collection(coll_t, unsigned int, unsigned int,
> coll_t, KeyValueStore::BufferTransaction&)+0x421) [0x91f091]
>  13: (KeyValueStore::_do_transaction(ObjectStore::Transaction&,
> KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0xa4c) [0x920f2c]
>  14: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*,
> std::allocator<ObjectStore::Transaction*> >&, unsigned long,
> ThreadPool::TPHandle*)+0x13f) [0x92385f]
>  15: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*,
> ThreadPool::TPHandle&)+0xac) [0x923a7c]
>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xb4ef50]
>  17: (ThreadPool::WorkThread::entry()+0x10) [0xb50040]
>  18: (()+0x7df3) [0x7f5e193cfdf3]
>  19: (clone()+0x6d) [0x7f5e17e963dd]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
>
> On some hosts that followed the crashing :
>
> =0 cs=0 l=1 c=0x139ed280).connect error 10.141.8.181:6822/36408, (111)
> Connection refused
>     -7> 2014-09-16 13:28:36.028858 7f74eb007700  2 -- 10.141.8.180:0/52318
>>> 10.141.8.181:6822/36408 pipe(0x18147bc0 sd=41 :0 s=1 pgs
> =0 cs=0 l=1 c=0x139ed280).fault (111) Connection refused
>     -6> 2014-09-16 13:28:36.029423 7f74e4c96700  2 -- 10.141.8.180:0/52318
>>> 10.143.8.181:6815/36408 pipe(0x18147640 sd=81 :0 s=1 pgs
> =0 cs=0 l=1 c=0x139e91e0).connect error 10.143.8.181:6815/36408, (111)
> Connection refused
>     -5> 2014-09-16 13:28:36.029443 7f74e4c96700  2 -- 10.141.8.180:0/52318
>>> 10.143.8.181:6815/36408 pipe(0x18147640 sd=81 :0 s=1 pgs
> =0 cs=0 l=1 c=0x139e91e0).fault (111) Connection refused
>     -4> 2014-09-16 13:28:36.101914 7f7509534700  1 --
> 10.143.8.180:6801/52318 <== osd.32 10.141.8.182:0/54784 2520 ====
> osd_ping(ping
> e8964 stamp 2014-09-16 13:28:36.101604) v2 ==== 47+0+0 (411091961 0 0)
> 0x189b50a0 con 0x14a0f7a0
>     -3> 2014-09-16 13:28:36.101952 7f7509534700  1 --
> 10.143.8.180:6801/52318 --> 10.141.8.182:0/54784 -- osd_ping(ping_reply
> e8941 st
> amp 2014-09-16 13:28:36.101604) v2 -- ?+0 0x1a0feea0 con 0x14a0f7a0
>     -2> 2014-09-16 13:28:36.101950 7f750ad37700  1 --
> 10.141.8.180:6801/52318 <== osd.32 10.141.8.182:0/54784 2520 ====
> osd_ping(ping
> e8964 stamp 2014-09-16 13:28:36.101604) v2 ==== 47+0+0 (411091961 0 0)
> 0x1178cce0 con 0x143944c0
>     -1> 2014-09-16 13:28:36.102005 7f750ad37700  1 --
> 10.141.8.180:6801/52318 --> 10.141.8.182:0/54784 -- osd_ping(ping_reply
> e8941 st
> amp 2014-09-16 13:28:36.101604) v2 -- ?+0 0x14b0f440 con 0x143944c0
>      0> 2014-09-16 13:28:36.183818 7f751681f700 -1 os/GenericObjectMap.cc:
> In function 'int GenericObjectMap::list_objects(const coll_
> t&, ghobject_t, int, std::vector<ghobject_t>*, ghobject_t*)' thread
> 7f751681f700 time 2014-09-16 13:28:36.181333
> os/GenericObjectMap.cc: 1094: FAILED assert(start <= header.oid)
>
>  ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)
>  1: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int,
> std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*
> )+0x45e) [0xa3b1ee]
>  2: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int,
> snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t>
>  >*, ghobject_t*)+0x274) [0x9042d4]
>  3: (KeyValueStore::_split_collection(coll_t, unsigned int, unsigned int,
> coll_t, KeyValueStore::BufferTransaction&)+0x421) [0x91f091]
>  4: (KeyValueStore::_do_transaction(ObjectStore::Transaction&,
> KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0xa4c) [0x920
> f2c]
>  5: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*,
> std::allocator<ObjectStore::Transaction*> >&, unsigned long,
>  ThreadPool::TPHandle*)+0x13f) [0x92385f]
>  6: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*,
> ThreadPool::TPHandle&)+0xac) [0x923a7c]
>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xb4ef50]
>  8: (ThreadPool::WorkThread::entry()+0x10) [0xb50040]
>  9: (()+0x7df3) [0x7f7520317df3]
>  10: (clone()+0x6d) [0x7f751edde3dd]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
>
>
> I tried to restart the crashed OSDS again, but it always fails instantly now
> with the above stacktrace ..
>
> Any ideas with this?
> Thanks a lot!
> Kenneth
>
>
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Mon, Sep 15, 2014 at 5:43 AM, Kenneth Waegeman
>> <Kenneth.Waegeman at ugent.be> wrote:
>>>
>>> Hi,
>>>
>>> I have some strange OSD problems. Before the weekend I started some rsync
>>> tests over CephFS, on a cache pool with underlying EC KV pool. Today the
>>> cluster is completely degraded:
>>>
>>> [root at ceph003 ~]# ceph status
>>>     cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>>>      health HEALTH_WARN 19 pgs backfill_toofull; 403 pgs degraded; 168
>>> pgs
>>> down; 8 pgs incomplete; 168 pgs peering; 61 pgs stale; 403 pgs stuck
>>> degraded; 176 pgs stuck inactive; 61 pgs stuck stale; 589 pgs stuck
>>> unclean;
>>> 403 pgs stuck undersized; 403 pgs undersized; 300 requests are blocked >
>>> 32
>>> sec; recovery 15170/27902361 objects degraded (0.054%); 1922/27902361
>>> objects misplaced (0.007%); 1 near full osd(s)
>>>      monmap e1: 3 mons at
>>>
>>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
>>> election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003
>>>      mdsmap e5: 1/1/1 up {0=ceph003=up:active}, 2 up:standby
>>>      osdmap e719: 48 osds: 18 up, 18 in
>>>       pgmap v144887: 1344 pgs, 4 pools, 4139 GB data, 2624 kobjects
>>>             2282 GB used, 31397 GB / 33680 GB avail
>>>             15170/27902361 objects degraded (0.054%); 1922/27902361
>>> objects
>>> misplaced (0.007%)
>>>                   68 down+remapped+peering
>>>                    1 active
>>>                  754 active+clean
>>>                    1 stale+incomplete
>>>                    1 stale+active+clean+scrubbing
>>>                   14 active+undersized+degraded+remapped
>>>                    7 incomplete
>>>                  100 down+peering
>>>                    9 active+remapped
>>>                   59 stale+active+undersized+degraded
>>>                   19 active+undersized+degraded+remapped+backfill_toofull
>>>                  311 active+undersized+degraded
>>>
>>> I tried to figure out what happened in the global logs:
>>>
>>> 2014-09-13 08:01:19.433313 mon.0 10.141.8.180:6789/0 66076 : [INF] pgmap
>>> v65892: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
>>> 129 TB avail; 4159 kB/s wr, 45 op/s
>>> 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap
>>> v65893: 1344 pgs: 1344
>>> 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap
>>> v65893: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
>>> 129 TB avail; 561 kB/s wr, 11 op/s
>>> 2014-09-13 08:01:20.777988 mon.0 10.141.8.180:6789/0 66081 : [INF] osd.19
>>> 10.141.8.181:6809/29664 failed (3 reports from 3 peers after 20.000079 >=
>>> grace 20.000000)
>>> 2014-09-13 08:01:21.455887 mon.0 10.141.8.180:6789/0 66083 : [INF] osdmap
>>> e117: 48 osds: 47 up, 48 in
>>> 2014-09-13 08:01:21.462084 mon.0 10.141.8.180:6789/0 66084 : [INF] pgmap
>>> v65894: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
>>> 129 TB avail; 1353 kB/s wr, 13 op/s
>>> 2014-09-13 08:01:21.477007 mon.0 10.141.8.180:6789/0 66085 : [INF] pgmap
>>> v65895: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>> data,
>>> 3116 GB used, 126 TB / 129 TB avail; 2300 kB/s wr, 21 op/s
>>> 2014-09-13 08:01:22.456055 mon.0 10.141.8.180:6789/0 66086 : [INF] osdmap
>>> e118: 48 osds: 47 up, 48 in
>>> 2014-09-13 08:01:22.462590 mon.0 10.141.8.180:6789/0 66087 : [INF] pgmap
>>> v65896: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>> data,
>>> 3116 GB used, 126 TB / 129 TB avail; 13686 kB/s wr, 5 op/s
>>> 2014-09-13 08:01:23.464302 mon.0 10.141.8.180:6789/0 66088 : [INF] pgmap
>>> v65897: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>> data,
>>> 3116 GB used, 126 TB / 129 TB avail; 11075 kB/s wr, 4 op/s
>>> 2014-09-13 08:01:24.477467 mon.0 10.141.8.180:6789/0 66089 : [INF] pgmap
>>> v65898: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>> data,
>>> 3116 GB used, 126 TB / 129 TB avail; 4932 kB/s wr, 38 op/s
>>> 2014-09-13 08:01:25.481027 mon.0 10.141.8.180:6789/0 66090 : [INF] pgmap
>>> v65899: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>> data,
>>> 3116 GB used, 126 TB / 129 TB avail; 5726 kB/s wr, 64 op/s
>>> 2014-09-13 08:01:19.336173 osd.1 10.141.8.180:6803/26712 54442 : [WRN] 1
>>> slow requests, 1 included below; oldest blocked for > 30.000137 secs
>>> 2014-09-13 08:01:19.336341 osd.1 10.141.8.180:6803/26712 54443 : [WRN]
>>> slow
>>> request 30.000137 seconds old, received at 2014-09-13 08:00:49.335339:
>>> osd_op(client.7448.1:17751783 10000203eac.0000000e [write 0~319488
>>> [1 at -1],startsync 0~0] 1.b
>>> 6c3a3a9 snapc 1=[] ondisk+write e116) currently reached pg
>>> 2014-09-13 08:01:20.337602 osd.1 10.141.8.180:6803/26712 54444 : [WRN] 7
>>> slow requests, 6 included below; oldest blocked for > 31.001947 secs
>>> 2014-09-13 08:01:20.337688 osd.1 10.141.8.180:6803/26712 54445 : [WRN]
>>> slow
>>> request 30.998110 seconds old, received at 2014-09-13 08:00:49.339176:
>>> osd_op(client.7448.1:17751787 10000203eac.0000000e [write 319488~65536
>>> [1 at -1],startsync 0~0]
>>>
>>>
>>> This is happening OSD after OSD..
>>>
>>> I tried to check the individual log of the osds, but all the individual
>>> logs
>>> stop abruptly (also from the osds that are still running):
>>>
>>> 2014-09-12 14:25:51.205276 7f3517209700  0 log [WRN] : 41 slow requests,
>>> 1
>>> included below; oldest blocked for > 38.118088 secs
>>> 2014-09-12 14:25:51.205337 7f3517209700  0 log [WRN] : slow request
>>> 36.558286 seconds old, received at 2014-09-12 14:25:14.646836:
>>> osd_op(client.7448.1:2458392 1000006328f.0000000b [write 3989504~204800
>>> [1 at -1],startsync 0~0] 1.9337bf4b snapc 1=[] ondisk+write e116) currently
>>> reached pg
>>> 2014-09-12 14:25:53.205586 7f3517209700  0 log [WRN] : 30 slow requests,
>>> 1
>>> included below; oldest blocked for > 40.118530 secs
>>> 2014-09-12 14:25:53.205679 7f3517209700  0 log [WRN] : slow request
>>> 30.541026 seconds old, received at 2014-09-12 14:25:22.664538:
>>> osd_op(client.7448.1:2460291 100000632b7.00000000 [write 0~691
>>> [1 at -1],startsync 0~0] 1.994248a8 snapc 1=[] ondisk+write e116) currently
>>> reached pg
>>> 2014-09-12 17:52:40.503917 7f34e8ed2700  0 -- 10.141.8.181:6809/29664 >>
>>> 10.141.8.181:6847/62389 pipe(0x247ce040 sd=327 :6809 s=0 pgs=0 cs=0 l=1
>>> c=0x1bc8b9c0).accept replacing existing (lossy) channel (new one lossy=1)
>>>
>>> I *think* the absence of the logs is some issue related to another issue
>>> I
>>> just found (http://tracker.ceph.com/issues/9470).
>>>
>>> So I can't found out the original problem with the log files..
>>>
>>> Is there any other way I can find out what started the crashing of 30
>>> osds ?
>>>
>>> Thanks!!
>>>
>>> Kenneth
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ----- End message from Gregory Farnum <greg at inktank.com> -----
>
> --
>
> Met vriendelijke groeten,
> Kenneth Waegeman
>
>