OSD troubles on FS+Tiering

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kenneth,

This problem is much like your last reported problem. It doesn't
backport to 0.85, so the only master branch has no existing bug.

On Tue, Sep 16, 2014 at 9:58 PM, Gregory Farnum <greg at inktank.com> wrote:
> Heh, you'll have to talk to Haomai about issues with the
> KeyValueStore, but I know he's found a number of issues in the version
> of it that went to 0.85.
>
> In future please flag when you're running with experimental stuff; it
> helps direct attention to the right places! ;)
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, Sep 16, 2014 at 5:28 AM, Kenneth Waegeman
> <Kenneth.Waegeman at ugent.be> wrote:
>>
>> ----- Message from Gregory Farnum <greg at inktank.com> ---------
>>    Date: Mon, 15 Sep 2014 10:37:07 -0700
>>    From: Gregory Farnum <greg at inktank.com>
>> Subject: Re: OSD troubles on FS+Tiering
>>      To: Kenneth Waegeman <Kenneth.Waegeman at ugent.be>
>>      Cc: ceph-users <ceph-users at lists.ceph.com>
>>
>>
>>> The pidfile bug is already fixed in master/giant branches.
>>>
>>> As for the crashing, I'd try killing all the osd processes and turning
>>> them back on again. It might just be some daemon restart failed, or
>>> your cluster could be sufficiently overloaded that the node disks are
>>> going unresponsive and they're suiciding, or...
>>
>>
>> I restarted them that way, and they eventually got clean again.
>> 'ceph status' printed that 'ecdata' pool had too few pgs, so I changed the
>> amount of pgs from 128 to 256 (with EC k+m=11)
>> After a few minutes I checked the cluster state again:
>>
>> [root at ceph001 ~]# ceph status
>>     cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>>      health HEALTH_WARN 100 pgs down; 155 pgs peering; 81 pgs stale; 240 pgs
>> stuck inactive; 81 pgs stuck stale; 240 pgs stuck unclean; 746 requests are
>> blocked > 32 sec; 'cache' at/near target max; pool ecdata pg_num 256 >
>> pgp_num 128
>>      monmap e1: 3 mons at
>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
>> election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003
>>      mdsmap e6993: 1/1/1 up {0=ceph003=up:active}, 2 up:standby
>>      osdmap e11023: 48 osds: 14 up, 14 in
>>       pgmap v160466: 1472 pgs, 4 pools, 3899 GB data, 2374 kobjects
>>             624 GB used, 7615 GB / 8240 GB avail
>>                   75 creating
>>                 1215 active+clean
>>                  100 down+peering
>>                    1 active+clean+scrubbing
>>                   10 stale
>>                   16 stale+active+clean
>>
>> Again 34 OSDS are down.. This time I have the error log, I checked a few osd
>> logs :
>>
>> I checked the first host that was marked down:
>>
>>    -17> 2014-09-16 13:27:49.962938 7f5dfe6a3700  5 osd.7 pg_epoch: 8912
>> pg[2.b0s3(unlocked)] enter Initial
>>    -16> 2014-09-16 13:27:50.008842 7f5e02eac700  1 --
>> 10.143.8.180:6833/53810 <== osd.30 10.141.8.181:0/37396 2524 ====
>> osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2 ==== 47+0+0
>> (3868888299 0 0) 0x18ef7080 con 0x6961600
>>    -15> 2014-09-16 13:27:50.008892 7f5e02eac700  1 --
>> 10.143.8.180:6833/53810 --> 10.141.8.181:0/37396 -- osd_ping(ping_reply
>> e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x7326900 con 0x6961600
>>    -14> 2014-09-16 13:27:50.009159 7f5e046af700  1 --
>> 10.141.8.180:6847/53810 <== osd.30 10.141.8.181:0/37396 2524 ====
>> osd_ping(ping e8912 stamp 2014-09-16 13:27:50.008514) v2 ==== 47+0+0
>> (3868888299 0 0) 0x2210a760 con 0xadd0420
>>    -13> 2014-09-16 13:27:50.009202 7f5e046af700  1 --
>> 10.141.8.180:6847/53810 --> 10.141.8.181:0/37396 -- osd_ping(ping_reply
>> e8912 stamp 2014-09-16 13:27:50.008514) v2 -- ?+0 0x14e35a00 con 0xadd0420
>>    -12> 2014-09-16 13:27:50.034378 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Reset 0.127612 1
>> 0.000123
>>    -11> 2014-09-16 13:27:50.034432 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Started
>>    -10> 2014-09-16 13:27:50.034452 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Start
>>     -9> 2014-09-16 13:27:50.034469 7f5dfeea4700  1 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] state<Start>: transitioning
>> to Stray
>>     -8> 2014-09-16 13:27:50.034491 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] exit Start 0.000038 0
>> 0.000000
>>     -7> 2014-09-16 13:27:50.034521 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.71s3( v 8864'33363 (374'30362,8864'33363] local-les=813 n=16075 ec=104
>> les/c 813/815 805/8912/791) [24,10,8,7,45,27,30,46,38,4,23] r=3 lpr=8912
>> pi=104-8911/54 crt=8864'33359 inactive NOTIFY] enter Started/Stray
>>     -6> 2014-09-16 13:27:50.034664 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
>> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
>> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit
>> Started/ReplicaActive/RepNotRecovering 7944.878905 22472 0.038180
>>     -5> 2014-09-16 13:27:50.034689 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
>> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
>> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit Started/ReplicaActive
>> 7944.878946 0 0.000000
>>     -4> 2014-09-16 13:27:50.034711 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
>> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
>> pi=104-813/36 luod=0'0 crt=8885'35261 active] exit Started 7945.924923 0
>> 0.000000
>>     -3> 2014-09-16 13:27:50.034732 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.7s10( v 8890'35265 (374'32264,8890'35265] local-les=816 n=32002 ec=104
>> les/c 816/818 805/814/730) [6,30,22,13,39,15,12,5,11,42,7] r=10 lpr=814
>> pi=104-813/36 luod=0'0 crt=8885'35261 active] enter Reset
>>     -2> 2014-09-16 13:27:50.034869 7f5dfeea4700  5 osd.7 pg_epoch: 8912
>> pg[2.87s10(unlocked)] enter Initial
>>     -1> 2014-09-16 13:27:50.042055 7f5e11981700  5 osd.7 8912 tick
>>      0> 2014-09-16 13:27:50.045856 7f5e1015f700 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f5e1015f700
>>
>>  ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)
>>  1: /usr/bin/ceph-osd() [0xa72096]
>>  2: (()+0xf130) [0x7f5e193d7130]
>>  3: (gsignal()+0x39) [0x7f5e17dd5989]
>>  4: (abort()+0x148) [0x7f5e17dd7098]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f5e186e89d5]
>>  6: (()+0x5e946) [0x7f5e186e6946]
>>  7: (()+0x5e973) [0x7f5e186e6973]
>>  8: (()+0x5eb9f) [0x7f5e186e6b9f]
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x1ef) [0xb5e58f]
>>  10: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int,
>> std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*)+0x45e)
>> [0xa3b1ee]
>>  11: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int,
>> snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t> >*,
>> ghobject_t*)+0x274) [0x9042d4]
>>  12: (KeyValueStore::_split_collection(coll_t, unsigned int, unsigned int,
>> coll_t, KeyValueStore::BufferTransaction&)+0x421) [0x91f091]
>>  13: (KeyValueStore::_do_transaction(ObjectStore::Transaction&,
>> KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0xa4c) [0x920f2c]
>>  14: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*,
>> std::allocator<ObjectStore::Transaction*> >&, unsigned long,
>> ThreadPool::TPHandle*)+0x13f) [0x92385f]
>>  15: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*,
>> ThreadPool::TPHandle&)+0xac) [0x923a7c]
>>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xb4ef50]
>>  17: (ThreadPool::WorkThread::entry()+0x10) [0xb50040]
>>  18: (()+0x7df3) [0x7f5e193cfdf3]
>>  19: (clone()+0x6d) [0x7f5e17e963dd]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>>
>> On some hosts that followed the crashing :
>>
>> =0 cs=0 l=1 c=0x139ed280).connect error 10.141.8.181:6822/36408, (111)
>> Connection refused
>>     -7> 2014-09-16 13:28:36.028858 7f74eb007700  2 -- 10.141.8.180:0/52318
>>>> 10.141.8.181:6822/36408 pipe(0x18147bc0 sd=41 :0 s=1 pgs
>> =0 cs=0 l=1 c=0x139ed280).fault (111) Connection refused
>>     -6> 2014-09-16 13:28:36.029423 7f74e4c96700  2 -- 10.141.8.180:0/52318
>>>> 10.143.8.181:6815/36408 pipe(0x18147640 sd=81 :0 s=1 pgs
>> =0 cs=0 l=1 c=0x139e91e0).connect error 10.143.8.181:6815/36408, (111)
>> Connection refused
>>     -5> 2014-09-16 13:28:36.029443 7f74e4c96700  2 -- 10.141.8.180:0/52318
>>>> 10.143.8.181:6815/36408 pipe(0x18147640 sd=81 :0 s=1 pgs
>> =0 cs=0 l=1 c=0x139e91e0).fault (111) Connection refused
>>     -4> 2014-09-16 13:28:36.101914 7f7509534700  1 --
>> 10.143.8.180:6801/52318 <== osd.32 10.141.8.182:0/54784 2520 ====
>> osd_ping(ping
>> e8964 stamp 2014-09-16 13:28:36.101604) v2 ==== 47+0+0 (411091961 0 0)
>> 0x189b50a0 con 0x14a0f7a0
>>     -3> 2014-09-16 13:28:36.101952 7f7509534700  1 --
>> 10.143.8.180:6801/52318 --> 10.141.8.182:0/54784 -- osd_ping(ping_reply
>> e8941 st
>> amp 2014-09-16 13:28:36.101604) v2 -- ?+0 0x1a0feea0 con 0x14a0f7a0
>>     -2> 2014-09-16 13:28:36.101950 7f750ad37700  1 --
>> 10.141.8.180:6801/52318 <== osd.32 10.141.8.182:0/54784 2520 ====
>> osd_ping(ping
>> e8964 stamp 2014-09-16 13:28:36.101604) v2 ==== 47+0+0 (411091961 0 0)
>> 0x1178cce0 con 0x143944c0
>>     -1> 2014-09-16 13:28:36.102005 7f750ad37700  1 --
>> 10.141.8.180:6801/52318 --> 10.141.8.182:0/54784 -- osd_ping(ping_reply
>> e8941 st
>> amp 2014-09-16 13:28:36.101604) v2 -- ?+0 0x14b0f440 con 0x143944c0
>>      0> 2014-09-16 13:28:36.183818 7f751681f700 -1 os/GenericObjectMap.cc:
>> In function 'int GenericObjectMap::list_objects(const coll_
>> t&, ghobject_t, int, std::vector<ghobject_t>*, ghobject_t*)' thread
>> 7f751681f700 time 2014-09-16 13:28:36.181333
>> os/GenericObjectMap.cc: 1094: FAILED assert(start <= header.oid)
>>
>>  ceph version 0.85 (a0c22842db9eaee9840136784e94e50fabe77187)
>>  1: (GenericObjectMap::list_objects(coll_t const&, ghobject_t, int,
>> std::vector<ghobject_t, std::allocator<ghobject_t> >*, ghobject_t*
>> )+0x45e) [0xa3b1ee]
>>  2: (KeyValueStore::collection_list_partial(coll_t, ghobject_t, int, int,
>> snapid_t, std::vector<ghobject_t, std::allocator<ghobject_t>
>>  >*, ghobject_t*)+0x274) [0x9042d4]
>>  3: (KeyValueStore::_split_collection(coll_t, unsigned int, unsigned int,
>> coll_t, KeyValueStore::BufferTransaction&)+0x421) [0x91f091]
>>  4: (KeyValueStore::_do_transaction(ObjectStore::Transaction&,
>> KeyValueStore::BufferTransaction&, ThreadPool::TPHandle*)+0xa4c) [0x920
>> f2c]
>>  5: (KeyValueStore::_do_transactions(std::list<ObjectStore::Transaction*,
>> std::allocator<ObjectStore::Transaction*> >&, unsigned long,
>>  ThreadPool::TPHandle*)+0x13f) [0x92385f]
>>  6: (KeyValueStore::_do_op(KeyValueStore::OpSequencer*,
>> ThreadPool::TPHandle&)+0xac) [0x923a7c]
>>  7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xb10) [0xb4ef50]
>>  8: (ThreadPool::WorkThread::entry()+0x10) [0xb50040]
>>  9: (()+0x7df3) [0x7f7520317df3]
>>  10: (clone()+0x6d) [0x7f751edde3dd]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>>
>> I tried to restart the crashed OSDS again, but it always fails instantly now
>> with the above stacktrace ..
>>
>> Any ideas with this?
>> Thanks a lot!
>> Kenneth
>>
>>
>>> -Greg
>>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>>
>>>
>>> On Mon, Sep 15, 2014 at 5:43 AM, Kenneth Waegeman
>>> <Kenneth.Waegeman at ugent.be> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I have some strange OSD problems. Before the weekend I started some rsync
>>>> tests over CephFS, on a cache pool with underlying EC KV pool. Today the
>>>> cluster is completely degraded:
>>>>
>>>> [root at ceph003 ~]# ceph status
>>>>     cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
>>>>      health HEALTH_WARN 19 pgs backfill_toofull; 403 pgs degraded; 168
>>>> pgs
>>>> down; 8 pgs incomplete; 168 pgs peering; 61 pgs stale; 403 pgs stuck
>>>> degraded; 176 pgs stuck inactive; 61 pgs stuck stale; 589 pgs stuck
>>>> unclean;
>>>> 403 pgs stuck undersized; 403 pgs undersized; 300 requests are blocked >
>>>> 32
>>>> sec; recovery 15170/27902361 objects degraded (0.054%); 1922/27902361
>>>> objects misplaced (0.007%); 1 near full osd(s)
>>>>      monmap e1: 3 mons at
>>>>
>>>> {ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0},
>>>> election epoch 8, quorum 0,1,2 ceph001,ceph002,ceph003
>>>>      mdsmap e5: 1/1/1 up {0=ceph003=up:active}, 2 up:standby
>>>>      osdmap e719: 48 osds: 18 up, 18 in
>>>>       pgmap v144887: 1344 pgs, 4 pools, 4139 GB data, 2624 kobjects
>>>>             2282 GB used, 31397 GB / 33680 GB avail
>>>>             15170/27902361 objects degraded (0.054%); 1922/27902361
>>>> objects
>>>> misplaced (0.007%)
>>>>                   68 down+remapped+peering
>>>>                    1 active
>>>>                  754 active+clean
>>>>                    1 stale+incomplete
>>>>                    1 stale+active+clean+scrubbing
>>>>                   14 active+undersized+degraded+remapped
>>>>                    7 incomplete
>>>>                  100 down+peering
>>>>                    9 active+remapped
>>>>                   59 stale+active+undersized+degraded
>>>>                   19 active+undersized+degraded+remapped+backfill_toofull
>>>>                  311 active+undersized+degraded
>>>>
>>>> I tried to figure out what happened in the global logs:
>>>>
>>>> 2014-09-13 08:01:19.433313 mon.0 10.141.8.180:6789/0 66076 : [INF] pgmap
>>>> v65892: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
>>>> 129 TB avail; 4159 kB/s wr, 45 op/s
>>>> 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap
>>>> v65893: 1344 pgs: 1344
>>>> 2014-09-13 08:01:20.443019 mon.0 10.141.8.180:6789/0 66078 : [INF] pgmap
>>>> v65893: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
>>>> 129 TB avail; 561 kB/s wr, 11 op/s
>>>> 2014-09-13 08:01:20.777988 mon.0 10.141.8.180:6789/0 66081 : [INF] osd.19
>>>> 10.141.8.181:6809/29664 failed (3 reports from 3 peers after 20.000079 >=
>>>> grace 20.000000)
>>>> 2014-09-13 08:01:21.455887 mon.0 10.141.8.180:6789/0 66083 : [INF] osdmap
>>>> e117: 48 osds: 47 up, 48 in
>>>> 2014-09-13 08:01:21.462084 mon.0 10.141.8.180:6789/0 66084 : [INF] pgmap
>>>> v65894: 1344 pgs: 1344 active+clean; 2606 GB data, 3116 GB used, 126 TB /
>>>> 129 TB avail; 1353 kB/s wr, 13 op/s
>>>> 2014-09-13 08:01:21.477007 mon.0 10.141.8.180:6789/0 66085 : [INF] pgmap
>>>> v65895: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>>> data,
>>>> 3116 GB used, 126 TB / 129 TB avail; 2300 kB/s wr, 21 op/s
>>>> 2014-09-13 08:01:22.456055 mon.0 10.141.8.180:6789/0 66086 : [INF] osdmap
>>>> e118: 48 osds: 47 up, 48 in
>>>> 2014-09-13 08:01:22.462590 mon.0 10.141.8.180:6789/0 66087 : [INF] pgmap
>>>> v65896: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>>> data,
>>>> 3116 GB used, 126 TB / 129 TB avail; 13686 kB/s wr, 5 op/s
>>>> 2014-09-13 08:01:23.464302 mon.0 10.141.8.180:6789/0 66088 : [INF] pgmap
>>>> v65897: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>>> data,
>>>> 3116 GB used, 126 TB / 129 TB avail; 11075 kB/s wr, 4 op/s
>>>> 2014-09-13 08:01:24.477467 mon.0 10.141.8.180:6789/0 66089 : [INF] pgmap
>>>> v65898: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>>> data,
>>>> 3116 GB used, 126 TB / 129 TB avail; 4932 kB/s wr, 38 op/s
>>>> 2014-09-13 08:01:25.481027 mon.0 10.141.8.180:6789/0 66090 : [INF] pgmap
>>>> v65899: 1344 pgs: 187 stale+active+clean, 1157 active+clean; 2606 GB
>>>> data,
>>>> 3116 GB used, 126 TB / 129 TB avail; 5726 kB/s wr, 64 op/s
>>>> 2014-09-13 08:01:19.336173 osd.1 10.141.8.180:6803/26712 54442 : [WRN] 1
>>>> slow requests, 1 included below; oldest blocked for > 30.000137 secs
>>>> 2014-09-13 08:01:19.336341 osd.1 10.141.8.180:6803/26712 54443 : [WRN]
>>>> slow
>>>> request 30.000137 seconds old, received at 2014-09-13 08:00:49.335339:
>>>> osd_op(client.7448.1:17751783 10000203eac.0000000e [write 0~319488
>>>> [1 at -1],startsync 0~0] 1.b
>>>> 6c3a3a9 snapc 1=[] ondisk+write e116) currently reached pg
>>>> 2014-09-13 08:01:20.337602 osd.1 10.141.8.180:6803/26712 54444 : [WRN] 7
>>>> slow requests, 6 included below; oldest blocked for > 31.001947 secs
>>>> 2014-09-13 08:01:20.337688 osd.1 10.141.8.180:6803/26712 54445 : [WRN]
>>>> slow
>>>> request 30.998110 seconds old, received at 2014-09-13 08:00:49.339176:
>>>> osd_op(client.7448.1:17751787 10000203eac.0000000e [write 319488~65536
>>>> [1 at -1],startsync 0~0]
>>>>
>>>>
>>>> This is happening OSD after OSD..
>>>>
>>>> I tried to check the individual log of the osds, but all the individual
>>>> logs
>>>> stop abruptly (also from the osds that are still running):
>>>>
>>>> 2014-09-12 14:25:51.205276 7f3517209700  0 log [WRN] : 41 slow requests,
>>>> 1
>>>> included below; oldest blocked for > 38.118088 secs
>>>> 2014-09-12 14:25:51.205337 7f3517209700  0 log [WRN] : slow request
>>>> 36.558286 seconds old, received at 2014-09-12 14:25:14.646836:
>>>> osd_op(client.7448.1:2458392 1000006328f.0000000b [write 3989504~204800
>>>> [1 at -1],startsync 0~0] 1.9337bf4b snapc 1=[] ondisk+write e116) currently
>>>> reached pg
>>>> 2014-09-12 14:25:53.205586 7f3517209700  0 log [WRN] : 30 slow requests,
>>>> 1
>>>> included below; oldest blocked for > 40.118530 secs
>>>> 2014-09-12 14:25:53.205679 7f3517209700  0 log [WRN] : slow request
>>>> 30.541026 seconds old, received at 2014-09-12 14:25:22.664538:
>>>> osd_op(client.7448.1:2460291 100000632b7.00000000 [write 0~691
>>>> [1 at -1],startsync 0~0] 1.994248a8 snapc 1=[] ondisk+write e116) currently
>>>> reached pg
>>>> 2014-09-12 17:52:40.503917 7f34e8ed2700  0 -- 10.141.8.181:6809/29664 >>
>>>> 10.141.8.181:6847/62389 pipe(0x247ce040 sd=327 :6809 s=0 pgs=0 cs=0 l=1
>>>> c=0x1bc8b9c0).accept replacing existing (lossy) channel (new one lossy=1)
>>>>
>>>> I *think* the absence of the logs is some issue related to another issue
>>>> I
>>>> just found (http://tracker.ceph.com/issues/9470).
>>>>
>>>> So I can't found out the original problem with the log files..
>>>>
>>>> Is there any other way I can find out what started the crashing of 30
>>>> osds ?
>>>>
>>>> Thanks!!
>>>>
>>>> Kenneth
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ----- End message from Gregory Farnum <greg at inktank.com> -----
>>
>> --
>>
>> Met vriendelijke groeten,
>> Kenneth Waegeman
>>
>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat


[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux