Thanks Christion for the additional information and comments. · upgraded the kernels, but still had poor performance · Removed all the pools and recreated with just a replication of 3, with the two pool for the data and metadata. No cache tier pool · Turned back on the write caching with hdparm. We do have a Large UPS and dual power supplies in the ceph unit. If we get a long power outage, everything will go down anyway. I am no longer seeing the issue of the slow requests, ops blocked, etc. I think I will push for the following design per ceph server 8 4TB sata drives 2 Samsung 128GB SM863 SSD each holding 4 osd journals With 4 hosts, and a replication of 3 to start with I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding the 4 osd journals, with 4 hosts in the cluster over infiniband. At the 4M read, watching iftop, the client is receiving between 4.5 GB/sec - 5.5Gb/sec over infiniband Which is around 600MB/sec and translates well to the FIO number. fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} --numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting --name=journal-test FIO Test Local disk SAN/NFS Ceph w/Repl/SSD journal 4M Writes 53 MB/sec 12 IOPS 62 MB/sec 15 IOPS 151 MB/sec 37 IOPS 4M Rand Writes 34 MB/sec 8 IOPS 63 MB/sec 15 IOPS 155 MB/sec 37 IOPS 4M Read 66 MB/sec 15 IOPS 102 MB/sec 25 IOPS 662 MB/sec 161 IOPS 4M Rand Read 73 MB/sec 17 IOPS 103 MB/sec 25 IOPS 670 MB/sec 163 IOPS 4K Writes 2.9 MB/sec 738 IOPS 3.8 MB/sec 952 IOPS 2.3 MB/sec 571 IOPS 4K Rand Writes 551 KB/sec 134 IOPS 3.6 MB/sec 911 IOPS 2.0 MB/sec 501 IOPS 4K Read 28 MB/sec 7001 IOPS 8 MB/sec 1945 IOPS 13 MB/sec 3256 IOPS 4K Rand Read 263 KB/sec 5 MB/sec 1246 IOPS 8 MB/sec 2015 IOPS That performance is fine for our needs Again, thanks for the help guys. Regards, Jim From: Christian Balzer<mailto:chibi@xxxxxxx> Sent: Wednesday, October 19, 2016 7:54 PM To: ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx> Cc: Jim Kilborn<mailto:jim@xxxxxxxxxxxx> Subject: Re: New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size Hello, On Wed, 19 Oct 2016 12:28:28 +0000 Jim Kilborn wrote: > I have setup a new linux cluster to allow migration from our old SAN based cluster to a new cluster with ceph. > All systems running centos 7.2 with the 3.10.0-327.36.1 kernel. As others mentioned, not a good choice, but also not the (main) cause of your problems. > I am basically running stock ceph settings, with just turning the write cache off via hdparm on the drives, and temporarily turning of scrubbing. > The former is bound to kill performance, if you care that much for your data but can't guarantee constant power (UPS, dual PSUs, etc), consider using a BBU caching controller. The later I venture you did because performance was abysmal with scrubbing enabled. Which is always a good indicator that your cluster needs tuning, improving. > The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So Server performance should be good. Memory is fine, CPU I can't tell from the model number and I'm not inclined to look up or guess, but that usually only becomes a bottleneck when dealing with all SSD setup and things requiring the lowest latency possible. > Since I am running cephfs, I have tiering setup. That should read "on top of EC pools", and as John said, not a good idea at all, both EC pools and cache-tiering. > Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So the idea is to ensure a single host failure. > Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a replicated set with size=2 This isn't a Seagate, you mean Samsung. And that's a consumer model, ill suited for this task, even with the DC level SSDs below as journals. And as such a replication of 2 is also ill advised, I've seen these SSDs die w/o ANY warning whatsoever and long before their (abysmal) endurance was exhausted. > The cache tier also has a 128GB SM863 SSD that is being used as a journal for the cache SSD. It has power loss protection Those are fine. If you re-do you cluster, don't put more than 4-5 journals on them. > My crush map is setup to ensure the cache pool uses only the 4 850 pro and the erasure code uses only the 16 spinning 4TB drives. > > The problems that I am seeing is that I start copying data from our old san to the ceph volume, and once the cache tier gets to my target_max_bytes of 1.4 TB, I start seeing: > > HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; noout,noscrub,nodeep-scrub,sortbitwise flag(s) set > 26 ops are blocked > 65.536 sec on osd.0 > 37 ops are blocked > 32.768 sec on osd.0 > 1 osds have slow requests > noout,noscrub,nodeep-scrub,sortbitwise flag(s) set > > osd.0 is the cache ssd > > If I watch iostat on the cache ssd, I see the queue lengths are high and the await are high > Below is the iostat on the cache drive (osd.0) on the first host. The avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > sdb > 0.00 0.33 9.00 84.33 0.96 20.11 462.40 75.92 397.56 125.67 426.58 10.70 99.90 > 0.00 0.67 30.00 87.33 5.96 21.03 471.20 67.86 910.95 87.00 1193.99 8.27 97.07 > 0.00 16.67 33.00 289.33 4.21 18.80 146.20 29.83 88.99 93.91 88.43 3.10 99.83 > 0.00 7.33 7.67 261.67 1.92 19.63 163.81 117.42 331.97 182.04 336.36 3.71 100.00 > > > If I look at the iostat for all the drives, only the cache ssd drive is backed up > Yes, consumer SSDs on top of a design that channels everything through them. Rebuild your cluster along more conventional and conservative lines, don't use the 850 PROs. Feel free to run any new design by us. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com