Re: New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks Christion for the additional information and comments.



·         upgraded the kernels, but still had poor performance

·         Removed all the pools and recreated with just a replication of 3, with the two pool for the data and metadata. No cache tier pool

·         Turned back on the write caching with hdparm. We do have a Large UPS and dual power supplies in the ceph unit. If we get a long power outage, everything will go down anyway.



I am no longer seeing the issue of the slow requests, ops blocked, etc.



I think I will push for the following design per ceph server



8  4TB sata drives

2 Samsung 128GB SM863 SSD each holding 4 osd journals



With 4 hosts, and a replication of 3 to start with



I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding the  4 osd journals, with 4 hosts in the cluster over infiniband.



At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 5.5Gb/sec over infiniband

Which is around 600MB/sec and translates well to the FIO number.



fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} --numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting --name=journal-test



FIO Test


Local disk


SAN/NFS


Ceph w/Repl/SSD journal


4M Writes


53 MB/sec   12 IOPS


62 MB/sec    15 IOPS


  151 MB/sec 37 IOPS


4M Rand Writes


34 MB/sec     8 IOPS


63 MB/sec    15 IOPS


  155 MB/sec 37 IOPS


4M Read


66 MB/sec   15 IOPS


102 MB/sec  25 IOPS


  662 MB/sec 161 IOPS


4M Rand Read


73 MB/sec   17 IOPS


103 MB/sec  25 IOPS


  670 MB/sec 163 IOPS














4K Writes


2.9 MB/sec 738 IOPS


3.8 MB/sec   952 IOPS


  2.3 MB/sec 571 IOPS


4K Rand Writes


551 KB/sec  134 IOPS


3.6 MB/sec   911 IOPS


  2.0 MB/sec 501 IOPS


4K Read


28 MB/sec 7001 IOPS


8 MB/sec 1945 IOPS


  13 MB/sec 3256 IOPS


4K Rand Read


263 KB/sec


5 MB/sec 1246 IOPS


  8 MB/sec  2015 IOPS




That performance is fine for our needs

Again, thanks for the help guys.



Regards,

Jim



From: Christian Balzer<mailto:chibi@xxxxxxx>
Sent: Wednesday, October 19, 2016 7:54 PM
To: ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
Cc: Jim Kilborn<mailto:jim@xxxxxxxxxxxx>
Subject: Re:  New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size



Hello,

On Wed, 19 Oct 2016 12:28:28 +0000 Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache off via hdparm on the drives, and temporarily turning of scrubbing.
>
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So Server performance should be good.
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.


> Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

> The cache tier also has a 128GB SM863 SSD that is being used as a journal for the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

> My crush map is setup to ensure the cache pool uses only the 4 850 pro and the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san to the ceph volume, and once the cache tier gets to my  target_max_bytes of 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>
> osd.0 is the cache ssd
>
> If I watch iostat on the cache ssd, I see the queue lengths are high and the await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>
> Device:   rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdb
>                   0.00     0.33    9.00   84.33     0.96    20.11   462.40    75.92  397.56  125.67  426.58  10.70  99.90
>                   0.00     0.67   30.00   87.33     5.96    21.03   471.20    67.86  910.95   87.00 1193.99   8.27  97.07
>                   0.00    16.67   33.00  289.33     4.21    18.80   146.20    29.83   88.99   93.91   88.43   3.10  99.83
>                   0.00     7.33    7.67  261.67     1.92    19.63   163.81   117.42  331.97  182.04  336.36   3.71 100.00
>
>
> If I look at the iostat for all the drives, only the cache ssd drive is backed up
>
Yes, consumer SSDs on top of a design that channels everything through
them.

Rebuild your cluster along more conventional and conservative lines, don't
use the 850 PROs.
Feel free to run any new design by us.

Christian
--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx    Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux