Re: Ceph random read IOPS

Nick Fisk <nick@xxxxxxxxxx> · Mon, 26 Jun 2017 15:06:46 +0100

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Willem Jan Withagen
> Sent: 26 June 2017 14:35
> To: Christian Wuerdig <christian.wuerdig@xxxxxxxxx>
> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> Subject: Re:  Ceph random read IOPS
> 
> On 26-6-2017 09:01, Christian Wuerdig wrote:
> > Well, preferring faster clock CPUs for SSD scenarios has been floated
> > several times over the last few months on this list. And realistic or
> > not, Nick's and Kostas' setup are similar enough (testing single disk)
> > that it's a distinct possibility.
> > Anyway, as mentioned measuring the performance counters would
> probably
> > provide more insight.
> 
> I read the advise as:
> 	prefer GHz over cores.
> 
> And especially since there is a sort of balance between either GHz or
cores,
> that can be an expensive one. Getting both means you have to pay
relatively
> substantial more money.
> 
> And for an average Ceph server with plenty OSDs, I personally just don't
buy
> that. There you'd have to look at the total throughput of the the system,
and
> latency is only one of the many factors.
> 
> Let alone in a cluster with several hosts (and or racks). There the
latency is
> dictated by the network. So a bad choice of network card or switch will
out
> do any extra cycles that your CPU can burn.
> 
> I think that just testing 1 OSD is testing artifacts, and has very little
to do with
> running an actual ceph cluster.
> 
> So if one would like to test this, the test setup should be something
> like: 3 hosts with something like 3 disks per host, min_disk=2  and a nice
> workload.
> Then turn the GHz-knob and see what happens with client latency and
> throughput.

Did similar tests last summer. 5 nodes with 12x 7.2k disks each, connected
via 10G. NVME journal. 3x replica pool.

First test was with C-states left to auto and frequency scaling leaving the
cores at lowest frequency of 900mhz. The cluster will quite happily do a
couple of thousand IO's without generating enough CPU load to boost the 4
cores up to max C-state or frequency.

With small background IO going on in background, a QD=1 sequential 4kb write
was done with the following results:

write: io=115268KB, bw=1670.1KB/s, iops=417, runt= 68986msec
    slat (usec): min=2, max=414, avg= 4.41, stdev= 3.81
    clat (usec): min=966, max=27116, avg=2386.84, stdev=571.57
     lat (usec): min=970, max=27120, avg=2391.25, stdev=571.69
    clat percentiles (usec):
     |  1.00th=[ 1480],  5.00th=[ 1688], 10.00th=[ 1912], 20.00th=[ 2128],
     | 30.00th=[ 2192], 40.00th=[ 2288], 50.00th=[ 2352], 60.00th=[ 2448],
     | 70.00th=[ 2576], 80.00th=[ 2704], 90.00th=[ 2832], 95.00th=[ 2960],
     | 99.00th=[ 3312], 99.50th=[ 3536], 99.90th=[ 6112], 99.95th=[ 9536],
     | 99.99th=[22400]

So just under 2.5ms write latency.

I don't have the results from the separate C-states/frequency scaling, but
adjusting either got me a boost. Forcing to C1 and max frequency of 3.6Ghz
got me:

write: io=105900KB, bw=5715.7KB/s, iops=1428, runt= 18528msec
    slat (usec): min=2, max=106, avg= 3.50, stdev= 1.31
    clat (usec): min=491, max=32099, avg=694.16, stdev=491.91
     lat (usec): min=494, max=32102, avg=697.66, stdev=492.04
    clat percentiles (usec):
     |  1.00th=[  540],  5.00th=[  572], 10.00th=[  588], 20.00th=[  604],
     | 30.00th=[  620], 40.00th=[  636], 50.00th=[  652], 60.00th=[  668],
     | 70.00th=[  692], 80.00th=[  716], 90.00th=[  764], 95.00th=[  820],
     | 99.00th=[ 1448], 99.50th=[ 2320], 99.90th=[ 7584], 99.95th=[11712],
     | 99.99th=[24448]

Quite a bit faster. Although these are best case figures, if any substantial
workload is run, the average tends to hover around 1ms latency.

Nick

> 
> --WjW
> 
> > On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen <wjw@xxxxxxxxxxx
> > <mailto:wjw@xxxxxxxxxxx>> wrote:
> >
> >
> >
> >     Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar
> <mmokhtar@xxxxxxxxxxx
> >     <mailto:mmokhtar@xxxxxxxxxxx>> het volgende geschreven:
> >
> >>     My understanding was this test is targeting latency more than
> >>     IOPS. This is probably why its was run using QD=1. It also makes
> >>     sense that cpu freq will be more important than cores.
> >>
> >
> >     But then it is not generic enough to be used as an advise!
> >     It is just a line in 3D-space.
> >     As there are so many
> >
> >     --WjW
> >
> >>     On 2017-06-24 12:52, Willem Jan Withagen wrote:
> >>
> >>>     On 24-6-2017 05:30, Christian Wuerdig wrote:
> >>>>     The general advice floating around is that your want CPUs with
high
> >>>>     clock speeds rather than more cores to reduce latency and
> >>>>     increase IOPS
> >>>>     for SSD setups (see also
> >>>>     http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
> >>>>     <http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-
> performance/>)
> >>>>     So
> >>>>     something like a E5-2667V4 might bring better results in that
> >>>>     situation.
> >>>>     Also there was some talk about disabling the processor C states
> >>>>     in order
> >>>>     to bring latency down (something like this should be easy to
test:
> >>>>     https://stackoverflow.com/a/22482722/220986
> >>>>     <https://stackoverflow.com/a/22482722/220986>)
> >>>
> >>>     I would be very careful to call this a general advice...
> >>>
> >>>     Although the article is interesting, it is rather single sided.
> >>>
> >>>     The only thing is shows that there is a lineair relation between
> >>>     clockspeed and write or read speeds???
> >>>     The article is rather vague on how and what is actually tested.
> >>>
> >>>     By just running a single OSD with no replication a lot of the
> >>>     functionality is left out of the equation.
> >>>     Nobody is running just 1 osD on a box in a normal cluster host.
> >>>
> >>>     Not using a serious SSD is another source of noise on the
conclusion.
> >>>     More Queue depth can/will certainly have impact on concurrency.
> >>>
> >>>     I would call this an observation, and nothing more.
> >>>
> >>>     --WjW
> >>>>
> >>>>     On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
> >>>>     <reverend.x3@xxxxxxxxx <mailto:reverend.x3@xxxxxxxxx>
> >>>>     <mailto:reverend.x3@xxxxxxxxx <mailto:reverend.x3@xxxxxxxxx>>>
> >>>>     wrote:
> >>>>
> >>>>         Hello,
> >>>>
> >>>>         We are in the process of evaluating the performance of a
testing
> >>>>         cluster (3 nodes) with ceph jewel. Our setup consists of:
> >>>>         3 monitors (VMs)
> >>>>         2 physical servers each connected with 1 JBOD running Ubuntu
> >>>>     Server
> >>>>         16.04
> >>>>
> >>>>         Each server has 32 threads @2.1GHz and 128GB RAM.
> >>>>         The disk distribution per server is:
> >>>>         38 * HUS726020ALS210 (SAS rotational)
> >>>>         2 * HUSMH8010BSS200 (SAS SSD for journals)
> >>>>         2 * ST1920FM0043 (SAS SSD for data)
> >>>>         1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
> >>>>
> >>>>         Since we don't currently have a 10Gbit switch, we test the
> >>>>     performance
> >>>>         with the cluster in a degraded state, the noout flag set and
> >>>>     we mount
> >>>>         rbd images on the powered on osd node. We confirmed that the
> >>>>     network
> >>>>         is not saturated during the tests.
> >>>>
> >>>>         We ran tests on the NVME disk and the pool created on this
> >>>>     disk where
> >>>>         we hoped to get the most performance without getting limited
> >>>>     by the
> >>>>         hardware specs since we have more disks than CPU threads.
> >>>>
> >>>>         The nvme disk was at first partitioned with one partition
> >>>>     and the
> >>>>         journal on the same disk. The performance on random 4K reads
> was
> >>>>         topped at 50K iops. We then removed the osd and partitioned
> >>>>     with 4
> >>>>         data partitions and 4 journals on the same disk. The
performance
> >>>>         didn't increase significantly. Also, since we run read
> >>>>     tests, the
> >>>>         journals shouldn't cause performance issues.
> >>>>
> >>>>         We then ran 4 fio processes in parallel on the same rbd
> >>>>     mounted image
> >>>>         and the total iops reached 100K. More parallel fio processes
> >>>>     didn't
> >>>>         increase the measured iops.
> >>>>
> >>>>         Our ceph.conf is pretty basic (debug is set to 0/0 for
> >>>>     everything) and
> >>>>         the crushmap just defines the different buckets/rules for
> >>>>     the disk
> >>>>         separation (rotational, ssd, nvme) in order to create the
> >>>>     required
> >>>>         pools
> >>>>
> >>>>         Is the performance of 100.000 iops for random 4K read normal
> >>>>     for a
> >>>>         disk that on the same benchmark runs at more than 300K iops
> >>>>     on the
> >>>>         same hardware or are we missing something?
> >>>>
> >>>>         Best regards,
> >>>>         Kostas
> >>>>         _______________________________________________
> >>>>         ceph-users mailing list
> >>>>         ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >>>>     <mailto:ceph-users@xxxxxxxxxxxxxx
> >>>>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
> >>>>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >>>>         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>     _______________________________________________
> >>>>     ceph-users mailing list
> >>>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >>>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >>>>
> >>>
> >>>     _______________________________________________
> >>>     ceph-users mailing list
> >>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> >>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >>
> >>
> >>
> >>
> >
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com