Stability Issue with 52 OSD hosts

Tyler Bishop <tyler.bishop@xxxxxxxxxxxxxxxxx> · Wed, 22 Aug 2018 23:00:24 -0400

Hi,   I've been fighting to get good stability on my cluster for about
3 weeks now.  I am running into intermittent issues with OSD flapping
marking other OSD down then going back to a stable state for hours and
days.

The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G
Network to 40G Brocade VDX Switches.  The OSD are 6TB HGST SAS drives
with 400GB HGST SAS 12G SSDs.   My configuration is 4 journals per
host with 12 disk per journal for a total of 56 disk per system and 52
OSD.

I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile
for throughput-performance enabled.

I have these sysctls set:

kernel.pid_max = 4194303
fs.file-max = 6553600
vm.swappiness = 0
vm.vfs_cache_pressure = 50
vm.min_free_kbytes = 3145728

I feel like my issue is directly related to the high number of OSD per
host but I'm not sure what issue I'm really running into.   I believe
that I have ruled out network issues, i am able to get 38Gbit
consistently via iperf testing and mtu for jump pings successfully
with no fragment set and 8972 packet size.

>From FIO testing I seem to be able to get 150-200k iops write from my
rbd clients on 1gbit networking... This is about what I expected due
to the write penalty and my underpowered CPU for the number of OSD.

I get these messages which I believe are normal?
2018-08-22 10:33:12.754722 7f7d009f5700  0 -- 10.20.136.8:6894/718902
>> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2
pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, going
to standby

Then randomly I'll get a storm of this every few days for 20 minutes or so:
2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333
heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back
2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff
2018-08-22 15:48:12.630773)

Please help!!!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com