Hello, On Wed, 19 Oct 2016 12:28:28 +0000 Jim Kilborn wrote: > I have setup a new linux cluster to allow migration from our old SAN based cluster to a new cluster with ceph. > All systems running centos 7.2 with the 3.10.0-327.36.1 kernel. As others mentioned, not a good choice, but also not the (main) cause of your problems. > I am basically running stock ceph settings, with just turning the write cache off via hdparm on the drives, and temporarily turning of scrubbing. > The former is bound to kill performance, if you care that much for your data but can't guarantee constant power (UPS, dual PSUs, etc), consider using a BBU caching controller. The later I venture you did because performance was abysmal with scrubbing enabled. Which is always a good indicator that your cluster needs tuning, improving. > The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So Server performance should be good. Memory is fine, CPU I can't tell from the model number and I'm not inclined to look up or guess, but that usually only becomes a bottleneck when dealing with all SSD setup and things requiring the lowest latency possible. > Since I am running cephfs, I have tiering setup. That should read "on top of EC pools", and as John said, not a good idea at all, both EC pools and cache-tiering. > Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So the idea is to ensure a single host failure. > Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a replicated set with size=2 This isn't a Seagate, you mean Samsung. And that's a consumer model, ill suited for this task, even with the DC level SSDs below as journals. And as such a replication of 2 is also ill advised, I've seen these SSDs die w/o ANY warning whatsoever and long before their (abysmal) endurance was exhausted. > The cache tier also has a 128GB SM863 SSD that is being used as a journal for the cache SSD. It has power loss protection Those are fine. If you re-do you cluster, don't put more than 4-5 journals on them. > My crush map is setup to ensure the cache pool uses only the 4 850 pro and the erasure code uses only the 16 spinning 4TB drives. > > The problems that I am seeing is that I start copying data from our old san to the ceph volume, and once the cache tier gets to my target_max_bytes of 1.4 TB, I start seeing: > > HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; noout,noscrub,nodeep-scrub,sortbitwise flag(s) set > 26 ops are blocked > 65.536 sec on osd.0 > 37 ops are blocked > 32.768 sec on osd.0 > 1 osds have slow requests > noout,noscrub,nodeep-scrub,sortbitwise flag(s) set > > osd.0 is the cache ssd > > If I watch iostat on the cache ssd, I see the queue lengths are high and the await are high > Below is the iostat on the cache drive (osd.0) on the first host. The avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms > > Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util > sdb > 0.00 0.33 9.00 84.33 0.96 20.11 462.40 75.92 397.56 125.67 426.58 10.70 99.90 > 0.00 0.67 30.00 87.33 5.96 21.03 471.20 67.86 910.95 87.00 1193.99 8.27 97.07 > 0.00 16.67 33.00 289.33 4.21 18.80 146.20 29.83 88.99 93.91 88.43 3.10 99.83 > 0.00 7.33 7.67 261.67 1.92 19.63 163.81 117.42 331.97 182.04 336.36 3.71 100.00 > > > If I look at the iostat for all the drives, only the cache ssd drive is backed up > Yes, consumer SSDs on top of a design that channels everything through them. Rebuild your cluster along more conventional and conservative lines, don't use the 850 PROs. Feel free to run any new design by us. Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com