Re: New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

Jim Kilborn <jim@xxxxxxxxxxxx> · Thu, 20 Oct 2016 15:45:34 +0000

The chart obviously didn’t go well. Here it is again

fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} --numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting --name=journal-test

FIO Test                 Local disk                              SAN/NFS                          Ceph size=3/SSD journal

4M Writes              53 MB/sec   12 IOPS         62 MB/sec    15 IOPS        151 MB/sec 37 IOPS

4M Rand Writes    34 MB/sec     8 IOPS         63 MB/sec    15 IOPS        155 MB/sec 37 IOPS

4M Read                  66 MB/sec   15 IOPS       102 MB/sec    25 IOPS       662 MB/sec 161 IOPS

4M Rand Read        73 MB/sec   17 IOPS       103 MB/sec    25 IOPS       670 MB/sec 163 IOPS

4K Writes                2.9 MB/sec 738 IOPS       3.8 MB/sec   952 IOPS        2.3 MB/sec 571 IOPS

4K Rand Writes     551 KB/sec  134 IOPS       3.6 MB/sec   911 IOPS       2.0 MB/sec 501 IOPS

4K Read                      28 MB/sec 7001 IOPS        8 MB/sec 1945 IOPS       13 MB/sec 3256 IOPS

4K Rand Read         263 KB/sec                            5 MB/sec 1246 IOPS         8 MB/sec  2015 IOPS

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Jim Kilborn<mailto:jim@xxxxxxxxxxxx>
Sent: Thursday, October 20, 2016 10:20 AM
To: Christian Balzer<mailto:chibi@xxxxxxx>; ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
Subject: Re:  New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

Thanks Christion for the additional information and comments.

·         upgraded the kernels, but still had poor performance

·         Removed all the pools and recreated with just a replication of 3, with the two pool for the data and metadata. No cache tier pool

·         Turned back on the write caching with hdparm. We do have a Large UPS and dual power supplies in the ceph unit. If we get a long power outage, everything will go down anyway.

I am no longer seeing the issue of the slow requests, ops blocked, etc.

I think I will push for the following design per ceph server

8  4TB sata drives

2 Samsung 128GB SM863 SSD each holding 4 osd journals

With 4 hosts, and a replication of 3 to start with

I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding the  4 osd journals, with 4 hosts in the cluster over infiniband.

At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 5.5Gb/sec over infiniband

Which is around 600MB/sec and translates well to the FIO number.

fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} --numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting --name=journal-test

FIO Test

Local disk

SAN/NFS

Ceph w/Repl/SSD journal

4M Writes

53 MB/sec   12 IOPS

62 MB/sec    15 IOPS

  151 MB/sec 37 IOPS

4M Rand Writes

34 MB/sec     8 IOPS

63 MB/sec    15 IOPS

  155 MB/sec 37 IOPS

4M Read

66 MB/sec   15 IOPS

102 MB/sec  25 IOPS

  662 MB/sec 161 IOPS

4M Rand Read

73 MB/sec   17 IOPS

103 MB/sec  25 IOPS

  670 MB/sec 163 IOPS

4K Writes

2.9 MB/sec 738 IOPS

3.8 MB/sec   952 IOPS

  2.3 MB/sec 571 IOPS

4K Rand Writes

551 KB/sec  134 IOPS

3.6 MB/sec   911 IOPS

  2.0 MB/sec 501 IOPS

4K Read

28 MB/sec 7001 IOPS

8 MB/sec 1945 IOPS

  13 MB/sec 3256 IOPS

4K Rand Read

263 KB/sec

5 MB/sec 1246 IOPS

  8 MB/sec  2015 IOPS

That performance is fine for our needs

Again, thanks for the help guys.

Regards,

Jim

From: Christian Balzer<mailto:chibi@xxxxxxx>
Sent: Wednesday, October 19, 2016 7:54 PM
To: ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
Cc: Jim Kilborn<mailto:jim@xxxxxxxxxxxx>
Subject: Re:  New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

Hello,

On Wed, 19 Oct 2016 12:28:28 +0000 Jim Kilborn wrote:

> I have setup a new linux cluster to allow migration from our old SAN based cluster to a new cluster with ceph.
> All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
As others mentioned, not a good choice, but also not the (main) cause of
your problems.

> I am basically running stock ceph settings, with just turning the write cache off via hdparm on the drives, and temporarily turning of scrubbing.
>
The former is bound to kill performance, if you care that much for your
data but can't guarantee constant power (UPS, dual PSUs, etc), consider
using a BBU caching controller.

The later I venture you did because performance was abysmal with scrubbing
enabled.
Which is always a good indicator that your cluster needs tuning, improving.

> The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So Server performance should be good.
Memory is fine, CPU I can't tell from the model number and I'm not
inclined to look up or guess, but that usually only becomes a bottleneck
when dealing with all SSD setup and things requiring the lowest latency
possible.

> Since I am running cephfs, I have tiering setup.
That should read "on top of EC pools", and as John said, not a good idea
at all, both EC pools and cache-tiering.

> Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So the idea is to ensure a single host failure.
> Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a replicated set with size=2

This isn't a Seagate, you mean Samsung. And that's a consumer model,
ill suited for this task, even with the DC level SSDs below as journals.

And as such a replication of 2 is also ill advised, I've seen these SSDs
die w/o ANY warning whatsoever and long before their (abysmal) endurance
was exhausted.

> The cache tier also has a 128GB SM863 SSD that is being used as a journal for the cache SSD. It has power loss protection

Those are fine. If you re-do you cluster, don't put more than 4-5 journals
on them.

> My crush map is setup to ensure the cache pool uses only the 4 850 pro and the erasure code uses only the 16 spinning 4TB drives.
>
> The problems that I am seeing is that I start copying data from our old san to the ceph volume, and once the cache tier gets to my  target_max_bytes of 1.4 TB, I start seeing:
>
> HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> 26 ops are blocked > 65.536 sec on osd.0
> 37 ops are blocked > 32.768 sec on osd.0
> 1 osds have slow requests
> noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
>
> osd.0 is the cache ssd
>
> If I watch iostat on the cache ssd, I see the queue lengths are high and the await are high
> Below is the iostat on the cache drive (osd.0) on the first host. The avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
>
> Device:   rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdb
>                   0.00     0.33    9.00   84.33     0.96    20.11   462.40    75.92  397.56  125.67  426.58  10.70  99.90
>                   0.00     0.67   30.00   87.33     5.96    21.03   471.20    67.86  910.95   87.00 1193.99   8.27  97.07
>                   0.00    16.67   33.00  289.33     4.21    18.80   146.20    29.83   88.99   93.91   88.43   3.10  99.83
>                   0.00     7.33    7.67  261.67     1.92    19.63   163.81   117.42  331.97  182.04  336.36   3.71 100.00
>
>
> If I look at the iostat for all the drives, only the cache ssd drive is backed up
>
Yes, consumer SSDs on top of a design that channels everything through
them.

Rebuild your cluster along more conventional and conservative lines, don't
use the 850 PROs.
Feel free to run any new design by us.

Christian
--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx    Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com