Re: production cluster down :(

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Nick,

thank you for your reply !

Indeed, jumbo frames was not activated.

So ping and all was working, so i thought network is up. But not with
enough mtu...

The f... supermicro switch just deleted the switch config, so i had to
recreate all and forgot about the MTU on the uplink ports.

Thank you !

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 30.09.2016 um 15:46 schrieb Nick Fisk:
>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Oliver Dzombic
>> Sent: 30 September 2016 14:16
>> To: ceph-users@xxxxxxxxxxxxxx
>> Subject:  production cluster down :(
>>
>> Hi,
>>
>> we have:
>>
>> ceph version 10.2.2
>>
>>      health HEALTH_ERR
>>             2240 pgs are stuck inactive for more than 300 seconds
>>             273 pgs down
>>             2240 pgs peering
>>             2240 pgs stuck inactive
>>             354 requests are blocked > 32 sec
>>             mds cluster is degraded
>>      monmap e1: 3 mons at
>> {cephmon1=10.0.0.11:6789/0,cephmon2=10.0.0.12:6789/0,cephmon3=10.0.0.13:6789/0}
>>             election epoch 146, quorum 0,1,2 cephmon1,cephmon2,cephmon3
>>       fsmap e114: 1/1/1 up {0=cephmon1=up:replay}
>>      osdmap e2322: 24 osds: 24 up, 24 in; 2230 remapped pgs
>>             flags sortbitwise
>>       pgmap v8774321: 2240 pgs, 4 pools, 9997 GB data, 2629 kobjects
>>             34753 GB used, 19173 GB / 53926 GB avail
>>                 1957 remapped+peering
>>                  273 down+remapped+peering
>>                   10 peering
>>
>>
>> health detail:
>>
>> http://pastebin.com/GsQcG2U0
>>
>>
>> Sample log from one OSD:
>>
>>
>>
>> 2016-09-30 15:01:07.066632 7f2b65d70700  0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for >
>> 659.155019 secs
>> 2016-09-30 15:01:07.066643 7f2b65d70700  0 log_channel(cluster) log [WRN] : slow request 480.599877 seconds old, received at 2016-
>> 09-30
>> 14:53:06.466705: osd_op(mds.0.114:4 5.64e96f8f (undecoded)
>> ack+read+known_if_redirected+full_force e2320) currently waiting for
>> ack+read+peered
>> 2016-09-30 15:05:06.894995 7f2b35c8c700  0 -- 10.0.1.15:6810/8033 >>
>> 10.0.1.16:6800/1679 pipe(0x7f2b9fc50800 sd=146 :6810 s=0 pgs=0 cs=0 l=0 c=0x7f2b9eaf1800).accept connect_seq 2 vs existing 1 state
>> open
>> 2016-09-30 15:05:06.895558 7f2b39fcf700  0 -- 10.0.1.15:6810/8033 >>
>> 10.0.1.16:6822/13278 pipe(0x7f2b9f199400 sd=207 :59416 s=2 pgs=47 cs=1
>> l=0 c=0x7f2b9f247d80).fault, initiating reconnect
>> 2016-09-30 15:05:06.895618 7f2b3a5d5700  0 -- 10.0.1.15:6810/8033 >>
>> 10.0.1.16:6822/13278 pipe(0x7f2b9f199400 sd=207 :59416 s=1 pgs=47 cs=2
>> l=0 c=0x7f2b9f247d80).fault
> 
> Not sure how much help I can provide, but are you sure all your networking is working 100% between all your OSd nodes? Can you see anything in the log of this 10.0.1.16 node that it's trying to connect to?
> 
> 
>>
>> MDS:
>>
>> 2016-09-30 14:53:05.112007 7f150e599180  0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-mds,
>> pid 1092
>> 2016-09-30 14:53:05.113631 7f150e599180  0 pidfile_write: ignore empty --pid-file
>> 2016-09-30 14:53:06.455957 7f1508574700  1 mds.cephmon1 handle_mds_map standby
>> 2016-09-30 14:53:06.467568 7f1508574700  1 mds.0.114 handle_mds_map i am now mds.0.114
>> 2016-09-30 14:53:06.467575 7f1508574700  1 mds.0.114 handle_mds_map state change up:boot --> up:replay
>> 2016-09-30 14:53:06.467591 7f1508574700  1 mds.0.114 replay_start
>> 2016-09-30 14:53:06.467683 7f1508574700  1 mds.0.114  recovery set is
>>
>>
>>
>> I already restarted ceph.
>>
>> Nothing helps.
>>
>> I have basically no idea what to do now.
>>
>> Any help is greatly appriciated !
>>
>> Thank you !
>>
>>
>> --
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:info@xxxxxxxxxxxxxxxxx
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux