Re: New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size

Christian Balzer <chibi@xxxxxxx> · Fri, 21 Oct 2016 11:50:24 +0900

Hello,

On Thu, 20 Oct 2016 15:45:34 +0000 Jim Kilborn wrote:

Good to know.

You may be able to squeeze some more 4K write IOPS out of this by cranking
the CPUs to full speed, see the relevant recent threads about this.

As for the 120GB (there is no 128GB SM863 model according to Samsung) SSDs
as journals, keep in mind that in your current cluster that limits you to
about 1TBW/day if you want them to survive 5 years.
Something to keep in mind.

Christian
> The chart obviously didn’t go well. Here it is again
> 
> 
> 
> fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} --numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting --name=journal-test
> 
> 
> 
> FIO Test                 Local disk                              SAN/NFS                          Ceph size=3/SSD journal
> 
> 4M Writes              53 MB/sec   12 IOPS         62 MB/sec    15 IOPS        151 MB/sec 37 IOPS
> 
> 4M Rand Writes    34 MB/sec     8 IOPS         63 MB/sec    15 IOPS        155 MB/sec 37 IOPS
> 
> 4M Read                  66 MB/sec   15 IOPS       102 MB/sec    25 IOPS       662 MB/sec 161 IOPS
> 
> 4M Rand Read        73 MB/sec   17 IOPS       103 MB/sec    25 IOPS       670 MB/sec 163 IOPS
> 
> 4K Writes                2.9 MB/sec 738 IOPS       3.8 MB/sec   952 IOPS        2.3 MB/sec 571 IOPS
> 
> 4K Rand Writes     551 KB/sec  134 IOPS       3.6 MB/sec   911 IOPS       2.0 MB/sec 501 IOPS
> 
> 4K Read                      28 MB/sec 7001 IOPS        8 MB/sec 1945 IOPS       13 MB/sec 3256 IOPS
> 
> 4K Rand Read         263 KB/sec                            5 MB/sec 1246 IOPS         8 MB/sec  2015 IOPS
> 
> 
> 
> Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10
> 
> 
> 
> From: Jim Kilborn<mailto:jim@xxxxxxxxxxxx>
> Sent: Thursday, October 20, 2016 10:20 AM
> To: Christian Balzer<mailto:chibi@xxxxxxx>; ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size
> 
> 
> 
> Thanks Christion for the additional information and comments.
> 
> 
> 
> ·         upgraded the kernels, but still had poor performance
> 
> ·         Removed all the pools and recreated with just a replication of 3, with the two pool for the data and metadata. No cache tier pool
> 
> ·         Turned back on the write caching with hdparm. We do have a Large UPS and dual power supplies in the ceph unit. If we get a long power outage, everything will go down anyway.
> 
> 
> 
> I am no longer seeing the issue of the slow requests, ops blocked, etc.
> 
> 
> 
> I think I will push for the following design per ceph server
> 
> 
> 
> 8  4TB sata drives
> 
> 2 Samsung 128GB SM863 SSD each holding 4 osd journals
> 
> 
> 
> With 4 hosts, and a replication of 3 to start with
> 
> 
> 
> I did a quick test with 4 - 4TB spinners and 1 Samsung 128GB SM863 SSD holding the  4 osd journals, with 4 hosts in the cluster over infiniband.
> 
> 
> 
> At the 4M read, watching iftop, the client is receiving between  4.5 GB/sec - 5.5Gb/sec over infiniband
> 
> Which is around 600MB/sec and translates well to the FIO number.
> 
> 
> 
> fio --direct=1 --sync=1 --rw={write,randwrite,read,randread} --bs={4M,4K} --numjobs=1 --iodepth=1 --runtime=60 --size=5G --time_based --group_reporting --name=journal-test
> 
> 
> 
> FIO Test
> 
> 
> Local disk
> 
> 
> SAN/NFS
> 
> 
> Ceph w/Repl/SSD journal
> 
> 
> 4M Writes
> 
> 
> 53 MB/sec   12 IOPS
> 
> 
> 62 MB/sec    15 IOPS
> 
> 
>   151 MB/sec 37 IOPS
> 
> 
> 4M Rand Writes
> 
> 
> 34 MB/sec     8 IOPS
> 
> 
> 63 MB/sec    15 IOPS
> 
> 
>   155 MB/sec 37 IOPS
> 
> 
> 4M Read
> 
> 
> 66 MB/sec   15 IOPS
> 
> 
> 102 MB/sec  25 IOPS
> 
> 
>   662 MB/sec 161 IOPS
> 
> 
> 4M Rand Read
> 
> 
> 73 MB/sec   17 IOPS
> 
> 
> 103 MB/sec  25 IOPS
> 
> 
>   670 MB/sec 163 IOPS
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 4K Writes
> 
> 
> 2.9 MB/sec 738 IOPS
> 
> 
> 3.8 MB/sec   952 IOPS
> 
> 
>   2.3 MB/sec 571 IOPS
> 
> 
> 4K Rand Writes
> 
> 
> 551 KB/sec  134 IOPS
> 
> 
> 3.6 MB/sec   911 IOPS
> 
> 
>   2.0 MB/sec 501 IOPS
> 
> 
> 4K Read
> 
> 
> 28 MB/sec 7001 IOPS
> 
> 
> 8 MB/sec 1945 IOPS
> 
> 
>   13 MB/sec 3256 IOPS
> 
> 
> 4K Rand Read
> 
> 
> 263 KB/sec
> 
> 
> 5 MB/sec 1246 IOPS
> 
> 
>   8 MB/sec  2015 IOPS
> 
> 
> 
> 
> That performance is fine for our needs
> 
> Again, thanks for the help guys.
> 
> 
> 
> Regards,
> 
> Jim
> 
> 
> 
> From: Christian Balzer<mailto:chibi@xxxxxxx>
> Sent: Wednesday, October 19, 2016 7:54 PM
> To: ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
> Cc: Jim Kilborn<mailto:jim@xxxxxxxxxxxx>
> Subject: Re:  New cephfs cluster performance issues- Jewel - cache pressure, capability release, poor iostat await avg queue size
> 
> 
> 
> Hello,
> 
> On Wed, 19 Oct 2016 12:28:28 +0000 Jim Kilborn wrote:
> 
> > I have setup a new linux cluster to allow migration from our old SAN based cluster to a new cluster with ceph.
> > All systems running centos 7.2 with the 3.10.0-327.36.1 kernel.
> As others mentioned, not a good choice, but also not the (main) cause of
> your problems.
> 
> > I am basically running stock ceph settings, with just turning the write cache off via hdparm on the drives, and temporarily turning of scrubbing.
> >
> The former is bound to kill performance, if you care that much for your
> data but can't guarantee constant power (UPS, dual PSUs, etc), consider
> using a BBU caching controller.
> 
> The later I venture you did because performance was abysmal with scrubbing
> enabled.
> Which is always a good indicator that your cluster needs tuning, improving.
> 
> > The 4 ceph servers are all Dell 730XD with 128GB memory, and dual xeon. So Server performance should be good.
> Memory is fine, CPU I can't tell from the model number and I'm not
> inclined to look up or guess, but that usually only becomes a bottleneck
> when dealing with all SSD setup and things requiring the lowest latency
> possible.
> 
> 
> > Since I am running cephfs, I have tiering setup.
> That should read "on top of EC pools", and as John said, not a good idea
> at all, both EC pools and cache-tiering.
> 
> > Each server has 4 – 4TB drives for the erasure code pool, with K=3 and M=1. So the idea is to ensure a single host failure.
> > Each server also has a 1TB Seagate 850 Pro SSD for the cache drive, in a replicated set with size=2
> 
> This isn't a Seagate, you mean Samsung. And that's a consumer model,
> ill suited for this task, even with the DC level SSDs below as journals.
> 
> And as such a replication of 2 is also ill advised, I've seen these SSDs
> die w/o ANY warning whatsoever and long before their (abysmal) endurance
> was exhausted.
> 
> > The cache tier also has a 128GB SM863 SSD that is being used as a journal for the cache SSD. It has power loss protection
> 
> Those are fine. If you re-do you cluster, don't put more than 4-5 journals
> on them.
> 
> > My crush map is setup to ensure the cache pool uses only the 4 850 pro and the erasure code uses only the 16 spinning 4TB drives.
> >
> > The problems that I am seeing is that I start copying data from our old san to the ceph volume, and once the cache tier gets to my  target_max_bytes of 1.4 TB, I start seeing:
> >
> > HEALTH_WARN 63 requests are blocked > 32 sec; 1 osds have slow requests; noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> > 26 ops are blocked > 65.536 sec on osd.0
> > 37 ops are blocked > 32.768 sec on osd.0
> > 1 osds have slow requests
> > noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
> >
> > osd.0 is the cache ssd
> >
> > If I watch iostat on the cache ssd, I see the queue lengths are high and the await are high
> > Below is the iostat on the cache drive (osd.0) on the first host. The avgqu-sz is between 87 and 182 and the await is between 88ms and 1193ms
> >
> > Device:   rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sdb
> >                   0.00     0.33    9.00   84.33     0.96    20.11   462.40    75.92  397.56  125.67  426.58  10.70  99.90
> >                   0.00     0.67   30.00   87.33     5.96    21.03   471.20    67.86  910.95   87.00 1193.99   8.27  97.07
> >                   0.00    16.67   33.00  289.33     4.21    18.80   146.20    29.83   88.99   93.91   88.43   3.10  99.83
> >                   0.00     7.33    7.67  261.67     1.92    19.63   163.81   117.42  331.97  182.04  336.36   3.71 100.00
> >
> >
> > If I look at the iostat for all the drives, only the cache ssd drive is backed up
> >
> Yes, consumer SSDs on top of a design that channels everything through
> them.
> 
> Rebuild your cluster along more conventional and conservative lines, don't
> use the 850 PROs.
> Feel free to run any new design by us.
> 
> Christian
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx    Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com