Hi, I've been fighting to get good stability on my cluster for about 3 weeks now. I am running into intermittent issues with OSD flapping marking other OSD down then going back to a stable state for hours and days. The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G Network to 40G Brocade VDX Switches. The OSD are 6TB HGST SAS drives with 400GB HGST SAS 12G SSDs. My configuration is 4 journals per host with 12 disk per journal for a total of 56 disk per system and 52 OSD. I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile for throughput-performance enabled. I have these sysctls set: kernel.pid_max = 4194303 fs.file-max = 6553600 vm.swappiness = 0 vm.vfs_cache_pressure = 50 vm.min_free_kbytes = 3145728 I feel like my issue is directly related to the high number of OSD per host but I'm not sure what issue I'm really running into. I believe that I have ruled out network issues, i am able to get 38Gbit consistently via iperf testing and mtu for jump pings successfully with no fragment set and 8972 packet size. >From FIO testing I seem to be able to get 150-200k iops write from my rbd clients on 1gbit networking... This is about what I expected due to the write penalty and my underpowered CPU for the number of OSD. I get these messages which I believe are normal? 2018-08-22 10:33:12.754722 7f7d009f5700 0 -- 10.20.136.8:6894/718902 >> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2 pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, going to standby Then randomly I'll get a storm of this every few days for 20 minutes or so: 2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333 heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back 2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff 2018-08-22 15:48:12.630773) Please help!!! _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com