Yes I've reviewed all the logs from monitor and host. I am not getting useful errors (or any) in dmesg or general messages. I have 2 ceph clusters, the other cluster is 300 SSD and i never have issues like this. That's why Im looking for help. On Thu, Aug 23, 2018 at 3:22 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: > > On Wed, Aug 22, 2018 at 11:39 PM Tyler Bishop > <tyler.bishop@xxxxxxxxxxxxxxxxx> wrote: > > > > During high load testing I'm only seeing user and sys cpu load around 60%... my load doesn't seem to be anything crazy on the host and iowait stays between 6 and 10%. I have very good `ceph osd perf` numbers too. > > > > I am using 10.2.11 Jewel. > > > > > > On Wed, Aug 22, 2018 at 11:30 PM Christian Balzer <chibi@xxxxxxx> wrote: > >> > >> Hello, > >> > >> On Wed, 22 Aug 2018 23:00:24 -0400 Tyler Bishop wrote: > >> > >> > Hi, I've been fighting to get good stability on my cluster for about > >> > 3 weeks now. I am running into intermittent issues with OSD flapping > >> > marking other OSD down then going back to a stable state for hours and > >> > days. > >> > > >> > The cluster is 4x Cisco UCS S3260 with dual E5-2660, 256GB ram, 40G > >> > Network to 40G Brocade VDX Switches. The OSD are 6TB HGST SAS drives > >> > with 400GB HGST SAS 12G SSDs. My configuration is 4 journals per > >> > host with 12 disk per journal for a total of 56 disk per system and 52 > >> > OSD. > >> > > >> Any denser and you'd have a storage black hole. > >> > >> You already pointed your finger in the (or at least one) right direction > >> and everybody will agree that this setup is woefully underpowered in the > >> CPU department. > >> > >> > I am using CentOS 7 with kernel 3.10 and the redhat tuned-adm profile > >> > for throughput-performance enabled. > >> > > >> Ceph version would be interesting as well... > >> > >> > I have these sysctls set: > >> > > >> > kernel.pid_max = 4194303 > >> > fs.file-max = 6553600 > >> > vm.swappiness = 0 > >> > vm.vfs_cache_pressure = 50 > >> > vm.min_free_kbytes = 3145728 > >> > > >> > I feel like my issue is directly related to the high number of OSD per > >> > host but I'm not sure what issue I'm really running into. I believe > >> > that I have ruled out network issues, i am able to get 38Gbit > >> > consistently via iperf testing and mtu for jump pings successfully > >> > with no fragment set and 8972 packet size. > >> > > >> The fact that it all works for days at a time suggests this as well, but > >> you need to verify these things when they're happening. > >> > >> > From FIO testing I seem to be able to get 150-200k iops write from my > >> > rbd clients on 1gbit networking... This is about what I expected due > >> > to the write penalty and my underpowered CPU for the number of OSD. > >> > > >> > I get these messages which I believe are normal? > >> > 2018-08-22 10:33:12.754722 7f7d009f5700 0 -- 10.20.136.8:6894/718902 > >> > >> 10.20.136.10:6876/490574 pipe(0x55aed77fd400 sd=192 :40502 s=2 > >> > pgs=1084 cs=53 l=0 c=0x55aed805bc80).fault with nothing to send, going > >> > to standby > >> > > >> Ignore. > >> > >> > Then randomly I'll get a storm of this every few days for 20 minutes or so: > >> > 2018-08-22 15:48:32.631186 7f44b7514700 -1 osd.127 37333 > >> > heartbeat_check: no reply from 10.20.142.11:6861 osd.198 since back > >> > 2018-08-22 15:48:08.052762 front 2018-08-22 15:48:31.282890 (cutoff > >> > 2018-08-22 15:48:12.630773) > >> > > >> Randomly is unlikely. > >> Again, catch it in the act, atop in huge terminal windows (showing all > >> CPUs and disks) for all nodes should be very telling, collecting and > >> graphing this data might work, too. > >> > >> My suspects would be deep scrubs and/or high IOPS spikes when this is > >> happening, starving out OSD processes (CPU wise, RAM should be fine one > >> supposes). > >> > >> Christian > >> > >> > Please help!!! > > Have you looked at the OSD logs on the OSD nodes by chance? I found > that correlating the messages in those logs with your master ceph log > and also correlating with any messages in syslog or kern.log can > elucidate the cause of the problem pretty well. > -- > Alex Gorbachev > Storcium > > > >> > _______________________________________________ > >> > ceph-users mailing list > >> > ceph-users@xxxxxxxxxxxxxx > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > >> > >> > >> -- > >> Christian Balzer Network/Systems Engineer > >> chibi@xxxxxxx Rakuten Communications > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com