OSD troubles on FS+Tiering

haomaiwang@xxxxxxxxx (Haomai Wang) · Wed, 17 Sep 2014 11:27:22 +0800

Hi Kenneth,

This problem is much like your last reported problem. It doesn't
backport to 0.85, so the only master branch has no existing bug.

On Tue, Sep 16, 2014 at 9:58 PM, Gregory Farnum <greg at inktank.com> wrote:
> Heh, you'll have to talk to Haomai about issues with the
> KeyValueStore, but I know he's found a number of issues in the version
> of it that went to 0.85.
>
> In future please flag when you're running with experimental stuff; it
> helps direct attention to the right places! ;)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, Sep 16, 2014 at 5:28 AM, Kenneth Waegeman
> <Kenneth.Waegeman at ugent.be> wrote:
>>
>> ----- Message from Gregory Farnum <greg at inktank.com> ---------
>>    Date: Mon, 15 Sep 2014 10:37:07 -0700
>>    From: Gregory Farnum <greg at inktank.com>
>> Subject: Re: OSD troubles on FS+Tiering
>>      To: Kenneth Waegeman <Kenneth.Waegeman at ugent.be>
>>      Cc: ceph-users <ceph-users at lists.ceph.com>
>>
>>
>>> The pidfile bug is already fixed in master/giant branches.
>>>
>>> As for the crashing, I'd try killing all the osd processes and turning
>>> them back on again. It might just be some daemon restart failed, or
>>> your cluster could be sufficiently overloaded that the node disks are
>>> going unresponsive and they're suiciding, or...
>>
>>
>> I restarted them that way, and they eventually got clean again.
>> 'ceph status' printed that 'ecdata' pool had too few pgs, so I changed the
>> amount of pgs from 128 to 256 (with EC k+m=11)
>> After a few minutes I checked the cluster state again:
>>
>> [root at ceph001 ~]# ceph status
>>     cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>>      health HEALTH_WARN 100 pgs down; 155 pgs peering; 81 pgs stale; 240 pgs
>> stuck inactive; 81 pgs stuck stale; 240 pgs stuck unclean; 746 requests are
>> blocked > 32 sec; 'cache' at/near target max; pool ecdata pg_num 256 >
>> pgp_num 128
>>      monmap e1: 3 mons at
>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
>> election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003
>>      mdsmap e6993: 1/1/1 up {0=ceph003=up:active}, 2 up:standby
>>      osdmap e11023: 48 osds: 14 up, 14 in
>>       pgmap v160466: 1472 pgs, 4 pools, 3899 GB data, 2374 kobjects
>>             624 GB used, 7615 GB / 8240 GB avail
>>                   75 creating
>>                 1215 active+clean
>>                  100 down+peering
>>                    1 active+clean+scrubbing
>>                   10 stale
>>                   16 stale+active+clean
>>
>> Again 34 OSDS are down.. This time I have the error log, I checked a few osd
>> logs :
>>
>> I checked the first host that was marked down:
>>
>>    -17> 2014-09-16 13:27:49.962938 7f5dfe6a3700  5 osd.7 pg_epoch: 8912
>> pg[2.b0s3(unlocked)] enter Initial
>>    -16> 2014-09-16 13:27:50.008842 7f5e02eac700  1 --
>> 10.143.8.180:6833/53810 <== osd.30 10.141.8.181:0/37396 2524 ====
>> osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2 ==== 47+0+0
>> (3868888299 0 0) 0x18ef7080 con 0x6961600
>>    -15> 2014-09-16 13:27:50.008892 7f5e02eac700  1 --
>> 10.143.8.180:6833/53810 --> 10.141.8.181:0/37396 -- osd_ping(ping_reply
>> e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x7326900 con 0x6961600
>>    -14> 2014-09-16 13:27:50.009159 7f5e046af700  1 --
>> 10.141.8.180:6847/53810 <== osd.30 10.141.8.181:0/37396 2524 ====
>> osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2 ==== 47+0+0
>> (3868888299 0 0) 0x2210a760 con 0xadd0420
>>    -13> 2014-09-16 13:27:50.009202 7f5e046af700  1 --
>> 10.141.8.180:6847/53810 --> 10.141.8.181:0/37396 -- osd_ping(ping_reply
>> e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x14e35a00 con 0xadd0420
>>    -12> 2014-09-16 13:27:50.034378 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Reset 0.127612 1
>> 0.000123
>>    -11> 2014-09-16 13:27:50.034432 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Started
>>    -10> 2014-09-16 13:27:50.034452 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Start
>>     -9> 2014-09-16 13:27:50.034469 7f5dfeea4700  1 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] state<Start>: transitioning
>> to Stray
>>     -8> 2014-09-16 13:27:50.034491 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Start 0.000038 0
>> 0.000000
>>     -7> 2014-09-16 13:27:50.034521 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Started/Stray
>>     -6> 2014-09-16 13:27:50.034664 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
>> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
>> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit
>> Started/ReplicaActive/RepNotRecovering 7944.878905 22472 0.038180
>>     -5> 2014-09-16 13:27:50.034689 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
>> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
>> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit Started/ReplicaActive
>> 7944.878946 0 0.000000
>>     -4> 2014-09-16 13:27:50.034711 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
>> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
>> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit Started 7945.924923 0
>> 0.000000
>>     -3> 2014-09-16 13:27:50.034732 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
>> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
>> pi=104-813/36 luod=0'0 crt=8885'35261 active] enter Reset
>>     -2> 2014-09-16 13:27:50.034869 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.87s10(unlocked)] enter Initial
>>     -1> 2014-09-16 13:27:50.042055 7f5e11981700  5 osd.7 8912 tick
>>      0> 2014-09-16 13:27:50.045856 7f5e1015f700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f5e1015f700
>>
>>  ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)
>>  1: /usr/bin/ceph-osd() [0xa72096]
>>  2: (()+0xf130) [0x7f5e193d7130]
>>  3: (gsignal()+0x39) [0x7f5e17dd5989]
>>  4: (abort()+0x148) [0x7f5e17dd7098]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5e186e89d5]
>>  6: (()+0x5e946) [0x7f5e186e6946]
>>  7: (()+0x5e973) [0x7f5e186e6973]
>>  8: (()+0x5eb9f) [0x7f5e186e6b9f]
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x1ef) [0xb5e58f]
>>  10: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int,
>> std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x45e)
>> [0xa3b1ee]
>>  11: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int,
>> snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*,
>> ghobject_t*)+0x274) [0x9042d4]
>>  12: (KeyValueStore::_split_collection(coll_t, unsigned int, unsigned int,
>> coll_t, KeyValueStore::BufferTransaction&)+0x421) [0x91f091]
>>  13: (KeyValueStore::_do_transaction(ObjectStore::Transaction&,
>> KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0xa4c) [0x920f2c]
>>  14: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*,
>> std::allocator<ObjectStore::Transaction*> >&, unsigned long,
>> ThreadPool::TPHandle*)+0x13f) [0x92385f]
>>  15: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*,
>> ThreadPool::TPHandle&)+0xac) [0x923a7c]
>>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xb4ef50]
>>  17: (ThreadPool::WorkThread::entry()+0x10) [0xb50040]
>>  18: (()+0x7df3) [0x7f5e193cfdf3]
>>  19: (clone()+0x6d) [0x7f5e17e963dd]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>>
>> On some hosts that followed the crashing :
>>
>> =0 cs=0 l=1 c=0x139ed280).connect error 10.141.8.181:6822/36408, (111)
>> Connection refused
>>     -7> 2014-09-16 13:28:36.028858 7f74eb007700  2 -- 10.141.8.180:0/52318
>>>> 10.141.8.181:6822/36408 pipe(0x18147bc0 sd=41 :0 s=1 pgs
>> =0 cs=0 l=1 c=0x139ed280).fault (111) Connection refused
>>     -6> 2014-09-16 13:28:36.029423 7f74e4c96700  2 -- 10.141.8.180:0/52318
>>>> 10.143.8.181:6815/36408 pipe(0x18147640 sd=81 :0 s=1 pgs
>> =0 cs=0 l=1 c=0x139e91e0).connect error 10.143.8.181:6815/36408, (111)
>> Connection refused
>>     -5> 2014-09-16 13:28:36.029443 7f74e4c96700  2 -- 10.141.8.180:0/52318
>>>> 10.143.8.181:6815/36408 pipe(0x18147640 sd=81 :0 s=1 pgs
>> =0 cs=0 l=1 c=0x139e91e0).fault (111) Connection refused
>>     -4> 2014-09-16 13:28:36.101914 7f7509534700  1 --
>> 10.143.8.180:6801/52318 <== osd.32 10.141.8.182:0/54784 2520 ====
>> osd_ping(ping
>> e8964 stamp 2014-09-16 13:28:36.101604) v2 ==== 47+0+0 (411091961 0 0)
>> 0x189b50a0 con 0x14a0f7a0
>>     -3> 2014-09-16 13:28:36.101952 7f7509534700  1 --
>> 10.143.8.180:6801/52318 --> 10.141.8.182:0/54784 -- osd_ping(ping_reply
>> e8941 st
>> amp 2014-09-16 13:28:36.101604) v2 -- ?+0 0x1a0feea0 con 0x14a0f7a0
>>     -2> 2014-09-16 13:28:36.101950 7f750ad37700  1 --
>> 10.141.8.180:6801/52318 <== osd.32 10.141.8.182:0/54784 2520 ====
>> osd_ping(ping
>> e8964 stamp 2014-09-16 13:28:36.101604) v2 ==== 47+0+0 (411091961 0 0)
>> 0x1178cce0 con 0x143944c0
>>     -1> 2014-09-16 13:28:36.102005 7f750ad37700  1 --
>> 10.141.8.180:6801/52318 --> 10.141.8.182:0/54784 -- osd_ping(ping_reply
>> e8941 st
>> amp 2014-09-16 13:28:36.101604) v2 -- ?+0 0x14b0f440 con 0x143944c0
>>      0> 2014-09-16 13:28:36.183818 7f751681f700 -1 os/GenericObjectMap.cc:
>> In function 'int GenericObjectMap::list_objects(const coll_
>> t&, ghobject_t, int, std::vector<ghobject_t>*, ghobject_t*)' thread
>> 7f751681f700 time 2014-09-16 13:28:36.181333
>> os/GenericObjectMap.cc: 1094: FAILED assert(start <= header.oid)
>>
>>  ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)
>>  1: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int,
>> std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*
>> )+0x45e) [0xa3b1ee]
>>  2: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int,
>> snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t>
>>  >*, ghobject_t*)+0x274) [0x9042d4]
>>  3: (KeyValueStore::_split_collection(coll_t, unsigned int, unsigned int,
>> coll_t, KeyValueStore::BufferTransaction&)+0x421) [0x91f091]
>>  4: (KeyValueStore::_do_transaction(ObjectStore::Transaction&,
>> KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0xa4c) [0x920
>> f2c]
>>  5: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*,
>> std::allocator<ObjectStore::Transaction*> >&, unsigned long,
>>  ThreadPool::TPHandle*)+0x13f) [0x92385f]
>>  6: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*,
>> ThreadPool::TPHandle&)+0xac) [0x923a7c]
>>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xb4ef50]
>>  8: (ThreadPool::WorkThread::entry()+0x10) [0xb50040]
>>  9: (()+0x7df3) [0x7f7520317df3]
>>  10: (clone()+0x6d) [0x7f751edde3dd]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>>
>> I tried to restart the crashed OSDS again, but it always fails instantly now
>> with the above stacktrace ..
>>
>> Any ideas with this?
>> Thanks a lot!
>> Kenneth
>>
>>
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Mon, Sep 15, 2014 at 5:43 AM, Kenneth Waegeman
>>> <Kenneth.Waegeman at ugent.be> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have some strange OSD problems. Before the weekend I started some rsync
>>>> tests over CephFS, on a cache pool with underlying EC KV pool. Today the
>>>> cluster is completely degraded:
>>>>
>>>> [root at ceph003 ~]# ceph status
>>>>     cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>>>>      health HEALTH_WARN 19 pgs backfill_toofull; 403 pgs degraded; 168
>>>> pgs
>>>> down; 8 pgs incomplete; 168 pgs peering; 61 pgs stale; 403 pgs stuck
>>>> degraded; 176 pgs stuck inactive; 61 pgs stuck stale; 589 pgs stuck
>>>> unclean;
>>>> 403 pgs stuck undersized; 403 pgs undersized; 300 requests are blocked >
>>>> 32
>>>> sec; recovery 15170/27902361 objects degraded (0.054%); 1922/27902361
>>>> objects misplaced (0.007%); 1 near full osd(s)
>>>>      monmap e1: 3 mons at
>>>>
>>>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
>>>> election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003
>>>>      mdsmap e5: 1/1/1 up {0=ceph003=up:active}, 2 up:standby
>>>>      osdmap e719: 48 osds: 18 up, 18 in
>>>>       pgmap v144887: 1344 pgs, 4 pools, 4139 GB data, 2624 kobjects
>>>>             2282 GB used, 31397 GB / 33680 GB avail
>>>>             15170/27902361 objects degraded (0.054%); 1922/27902361
>>>> objects
>>>> misplaced (0.007%)
>>>>                   68 down+remapped+peering
>>>>                    1 active
>>>>                  754 active+clean
>>>>                    1 stale+incomplete
>>>>                    1 stale+active+clean+scrubbing
>>>>                   14 active+undersized+degraded+remapped
>>>>                    7 incomplete
>>>>                  100 down+peering
>>>>                    9 active+remapped
>>>>                   59 stale+active+undersized+degraded
>>>>                   19 active+undersized+degraded+remapped+backfill_toofull
>>>>                  311 active+undersized+degraded
>>>>
>>>> I tried to figure out what happened in the global logs:
>>>>
>>>> 2014-09-13 08:01:19.433313 mon.0 10.141.8.180:6789/0 66076 : [INF] pgmap
>>>> v65892: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
>>>> 129 TB avail; 4159 kB/s wr, 45 op/s
>>>> 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap
>>>> v65893: 1344 pgs: 1344
>>>> 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap
>>>> v65893: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
>>>> 129 TB avail; 561 kB/s wr, 11 op/s
>>>> 2014-09-13 08:01:20.777988 mon.0 10.141.8.180:6789/0 66081 : [INF] osd.19
>>>> 10.141.8.181:6809/29664 failed (3 reports from 3 peers after 20.000079 >=
>>>> grace 20.000000)
>>>> 2014-09-13 08:01:21.455887 mon.0 10.141.8.180:6789/0 66083 : [INF] osdmap
>>>> e117: 48 osds: 47 up, 48 in
>>>> 2014-09-13 08:01:21.462084 mon.0 10.141.8.180:6789/0 66084 : [INF] pgmap
>>>> v65894: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
>>>> 129 TB avail; 1353 kB/s wr, 13 op/s
>>>> 2014-09-13 08:01:21.477007 mon.0 10.141.8.180:6789/0 66085 : [INF] pgmap
>>>> v65895: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>>> data,
>>>> 3116 GB used, 126 TB / 129 TB avail; 2300 kB/s wr, 21 op/s
>>>> 2014-09-13 08:01:22.456055 mon.0 10.141.8.180:6789/0 66086 : [INF] osdmap
>>>> e118: 48 osds: 47 up, 48 in
>>>> 2014-09-13 08:01:22.462590 mon.0 10.141.8.180:6789/0 66087 : [INF] pgmap
>>>> v65896: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>>> data,
>>>> 3116 GB used, 126 TB / 129 TB avail; 13686 kB/s wr, 5 op/s
>>>> 2014-09-13 08:01:23.464302 mon.0 10.141.8.180:6789/0 66088 : [INF] pgmap
>>>> v65897: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>>> data,
>>>> 3116 GB used, 126 TB / 129 TB avail; 11075 kB/s wr, 4 op/s
>>>> 2014-09-13 08:01:24.477467 mon.0 10.141.8.180:6789/0 66089 : [INF] pgmap
>>>> v65898: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>>> data,
>>>> 3116 GB used, 126 TB / 129 TB avail; 4932 kB/s wr, 38 op/s
>>>> 2014-09-13 08:01:25.481027 mon.0 10.141.8.180:6789/0 66090 : [INF] pgmap
>>>> v65899: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>>> data,
>>>> 3116 GB used, 126 TB / 129 TB avail; 5726 kB/s wr, 64 op/s
>>>> 2014-09-13 08:01:19.336173 osd.1 10.141.8.180:6803/26712 54442 : [WRN] 1
>>>> slow requests, 1 included below; oldest blocked for > 30.000137 secs
>>>> 2014-09-13 08:01:19.336341 osd.1 10.141.8.180:6803/26712 54443 : [WRN]
>>>> slow
>>>> request 30.000137 seconds old, received at 2014-09-13 08:00:49.335339:
>>>> osd_op(client.7448.1:17751783 10000203eac.0000000e [write 0~319488
>>>> [1 at -1],startsync 0~0] 1.b
>>>> 6c3a3a9 snapc 1=[] ondisk+write e116) currently reached pg
>>>> 2014-09-13 08:01:20.337602 osd.1 10.141.8.180:6803/26712 54444 : [WRN] 7
>>>> slow requests, 6 included below; oldest blocked for > 31.001947 secs
>>>> 2014-09-13 08:01:20.337688 osd.1 10.141.8.180:6803/26712 54445 : [WRN]
>>>> slow
>>>> request 30.998110 seconds old, received at 2014-09-13 08:00:49.339176:
>>>> osd_op(client.7448.1:17751787 10000203eac.0000000e [write 319488~65536
>>>> [1 at -1],startsync 0~0]
>>>>
>>>>
>>>> This is happening OSD after OSD..
>>>>
>>>> I tried to check the individual log of the osds, but all the individual
>>>> logs
>>>> stop abruptly (also from the osds that are still running):
>>>>
>>>> 2014-09-12 14:25:51.205276 7f3517209700  0 log [WRN] : 41 slow requests,
>>>> 1
>>>> included below; oldest blocked for > 38.118088 secs
>>>> 2014-09-12 14:25:51.205337 7f3517209700  0 log [WRN] : slow request
>>>> 36.558286 seconds old, received at 2014-09-12 14:25:14.646836:
>>>> osd_op(client.7448.1:2458392 1000006328f.0000000b [write 3989504~204800
>>>> [1 at -1],startsync 0~0] 1.9337bf4b snapc 1=[] ondisk+write e116) currently
>>>> reached pg
>>>> 2014-09-12 14:25:53.205586 7f3517209700  0 log [WRN] : 30 slow requests,
>>>> 1
>>>> included below; oldest blocked for > 40.118530 secs
>>>> 2014-09-12 14:25:53.205679 7f3517209700  0 log [WRN] : slow request
>>>> 30.541026 seconds old, received at 2014-09-12 14:25:22.664538:
>>>> osd_op(client.7448.1:2460291 100000632b7.00000000 [write 0~691
>>>> [1 at -1],startsync 0~0] 1.994248a8 snapc 1=[] ondisk+write e116) currently
>>>> reached pg
>>>> 2014-09-12 17:52:40.503917 7f34e8ed2700  0 -- 10.141.8.181:6809/29664 >>
>>>> 10.141.8.181:6847/62389 pipe(0x247ce040 sd=327 :6809 s=0 pgs=0 cs=0 l=1
>>>> c=0x1bc8b9c0).accept replacing existing (lossy) channel (new one lossy=1)
>>>>
>>>> I *think* the absence of the logs is some issue related to another issue
>>>> I
>>>> just found (http://tracker.ceph.com/issues/9470).
>>>>
>>>> So I can't found out the original problem with the log files..
>>>>
>>>> Is there any other way I can find out what started the crashing of 30
>>>> osds ?
>>>>
>>>> Thanks!!
>>>>
>>>> Kenneth
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ----- End message from Gregory Farnum <greg at inktank.com> -----
>>
>> --
>>
>> Met vriendelijke groeten,
>> Kenneth Waegeman
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Best Regards,

Wheat