Re: New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

Christian Balzer <chibi@xxxxxxx> · Thu, 20 Oct 2016 09:54:31 +0900

Hello,

On Wed, 19 Oct 2016 12:28:28 +0000 Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache off via hdparm on the drives, and temporarily turning of scrubbing.
> 
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So Server performance should be good.  
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.

> Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

> The cache tier also has a 128GB SM863 SSD that is being used as a journal for the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

> My crush map is setup to ensure the cache pool uses only the 4 850 pro and the erasure code uses only the 16 spinning 4TB drives.
> 
> The problems that I am seeing is that I start copying data from our old san to the ceph volume, and once the cache tier gets to my  target_max_bytes of 1.4 TB, I start seeing:
> 
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 
> osd.0 is the cache ssd
> 
> If I watch iostat on the cache ssd, I see the queue lengths are high and the await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
> 
> Device:   rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdb
>                   0.00     0.33    9.00   84.33     0.96    20.11   462.40    75.92  397.56  125.67  426.58  10.70  99.90
>                   0.00     0.67   30.00   87.33     5.96    21.03   471.20    67.86  910.95   87.00 1193.99   8.27  97.07
>                   0.00    16.67   33.00  289.33     4.21    18.80   146.20    29.83   88.99   93.91   88.43   3.10  99.83
>                   0.00     7.33    7.67  261.67     1.92    19.63   163.81   117.42  331.97  182.04  336.36   3.71 100.00
> 
> 
> If I look at the iostat for all the drives, only the cache ssd drive is backed up
> 
Yes, consumer SSDs on top of a design that channels everything through
them.

Rebuild your cluster along more conventional and conservative lines, don't
use the 850 PROs. 
Feel free to run any new design by us.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com