Re: Ceph random read IOPS

Christian Balzer <chibi@xxxxxxx> · Tue, 27 Jun 2017 10:11:08 +0900

On Mon, 26 Jun 2017 15:06:46 +0100 Nick Fisk wrote:

> > -----Original Message-----
> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> > Willem Jan Withagen
> > Sent: 26 June 2017 14:35
> > To: Christian Wuerdig <christian.wuerdig@xxxxxxxxx>
> > Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
> > Subject: Re:  Ceph random read IOPS
> > 
> > On 26-6-2017 09:01, Christian Wuerdig wrote:  
> > > Well, preferring faster clock CPUs for SSD scenarios has been floated
> > > several times over the last few months on this list. And realistic or
> > > not, Nick's and Kostas' setup are similar enough (testing single disk)
> > > that it's a distinct possibility.
> > > Anyway, as mentioned measuring the performance counters would  
> > probably  
> > > provide more insight.  
> > 
> > I read the advise as:
> > 	prefer GHz over cores.
> > 
> > And especially since there is a sort of balance between either GHz or  
> cores,
> > that can be an expensive one. Getting both means you have to pay  
> relatively
> > substantial more money.
> > 
> > And for an average Ceph server with plenty OSDs, I personally just don't  
> buy
> > that. There you'd have to look at the total throughput of the the system,  
> and
> > latency is only one of the many factors.
> > 
> > Let alone in a cluster with several hosts (and or racks). There the  
> latency is
> > dictated by the network. So a bad choice of network card or switch will  
> out
> > do any extra cycles that your CPU can burn.
> > 
> > I think that just testing 1 OSD is testing artifacts, and has very little  
> to do with
> > running an actual ceph cluster.
> > 
> > So if one would like to test this, the test setup should be something
> > like: 3 hosts with something like 3 disks per host, min_disk=2  and a nice
> > workload.
> > Then turn the GHz-knob and see what happens with client latency and
> > throughput.  
> 
> Did similar tests last summer. 5 nodes with 12x 7.2k disks each, connected
> via 10G. NVME journal. 3x replica pool.
> 
> First test was with C-states left to auto and frequency scaling leaving the
> cores at lowest frequency of 900mhz. The cluster will quite happily do a
> couple of thousand IO's without generating enough CPU load to boost the 4
> cores up to max C-state or frequency.
> 
> With small background IO going on in background, a QD=1 sequential 4kb write
> was done with the following results:
> 
> write: io=115268KB, bw=1670.1KB/s, iops=417, runt= 68986msec
>     slat (usec): min=2, max=414, avg= 4.41, stdev= 3.81
>     clat (usec): min=966, max=27116, avg=2386.84, stdev=571.57
>      lat (usec): min=970, max=27120, avg=2391.25, stdev=571.69
>     clat percentiles (usec):
>      |  1.00th=[ 1480],  5.00th=[ 1688], 10.00th=[ 1912], 20.00th=[ 2128],
>      | 30.00th=[ 2192], 40.00th=[ 2288], 50.00th=[ 2352], 60.00th=[ 2448],
>      | 70.00th=[ 2576], 80.00th=[ 2704], 90.00th=[ 2832], 95.00th=[ 2960],
>      | 99.00th=[ 3312], 99.50th=[ 3536], 99.90th=[ 6112], 99.95th=[ 9536],
>      | 99.99th=[22400]
> 
> So just under 2.5ms write latency.
> 
> I don't have the results from the separate C-states/frequency scaling, but
> adjusting either got me a boost. Forcing to C1 and max frequency of 3.6Ghz
> got me:
> 
> write: io=105900KB, bw=5715.7KB/s, iops=1428, runt= 18528msec
>     slat (usec): min=2, max=106, avg= 3.50, stdev= 1.31
>     clat (usec): min=491, max=32099, avg=694.16, stdev=491.91
>      lat (usec): min=494, max=32102, avg=697.66, stdev=492.04
>     clat percentiles (usec):
>      |  1.00th=[  540],  5.00th=[  572], 10.00th=[  588], 20.00th=[  604],
>      | 30.00th=[  620], 40.00th=[  636], 50.00th=[  652], 60.00th=[  668],
>      | 70.00th=[  692], 80.00th=[  716], 90.00th=[  764], 95.00th=[  820],
>      | 99.00th=[ 1448], 99.50th=[ 2320], 99.90th=[ 7584], 99.95th=[11712],
>      | 99.99th=[24448]
> 
> Quite a bit faster. Although these are best case figures, if any substantial
> workload is run, the average tends to hover around 1ms latency.
> 

And that's that.

If you care about latency and/or your "high" IOPS load is such that it
would still fit on a single core (real CPU usage of the OSD process less
than 100%) then less, faster cores are definitely the way to go.

Unfortunately single chip systems with current Intel offerings tend to
limit you as far as size and PCIe connections are concerned, not more than
6 SSDs realistically. So if you want more storage devices (need more cores)
or use NVMe (need more PCIe lanes), then you're forced into using dual CPU
systems, both paying for that pleasure by default and with 2 NUMA nodes as
well.

I predict that servers based on the new AMD Epyc CPUs will make absolutely
lovely OSD hosts, having loads of I/O, PCIe channels, plenty of fast
cores, full speed interconnect between dies (if you need more than 8 real
cores), thus basically all in a single NUMA zone with single chip systems.

As for the OP, if you're still reading this thread that is, your
assumption that a device that can do 300K IOPS (reads, also not something
most people care too much about) locally will still be able do do after all
the talked about latencies and contention within Ceph is of course deeply
flawed. 
There's also that little detail that the journal writes are more akin to a
serial write, so the write speed of the unit is more critical. 

You want to do your tests while running atop or the likes, collecting very
fine grained data of all bits that count and then the bottleneck should be
quite obvious. That's on all the OSD boxes and the client.

As hinted above, most people tend to run into write IOPS and latency
limitations long before read ones, but that of course depends on your use
case and things like RAM on the OSD servers being plentiful enough to hold
all SLAB bits (dentry etc) and really hot objects in pagecache. 

Christian
> Nick
> 
> > 
> > --WjW
> >   
> > > On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen <wjw@xxxxxxxxxxx
> > > <mailto:wjw@xxxxxxxxxxx>> wrote:
> > >
> > >
> > >
> > >     Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar  
> > <mmokhtar@xxxxxxxxxxx  
> > >     <mailto:mmokhtar@xxxxxxxxxxx>> het volgende geschreven:
> > >  
> > >>     My understanding was this test is targeting latency more than
> > >>     IOPS. This is probably why its was run using QD=1. It also makes
> > >>     sense that cpu freq will be more important than cores.
> > >>  
> > >
> > >     But then it is not generic enough to be used as an advise!
> > >     It is just a line in 3D-space.
> > >     As there are so many
> > >
> > >     --WjW
> > >  
> > >>     On 2017-06-24 12:52, Willem Jan Withagen wrote:
> > >>  
> > >>>     On 24-6-2017 05:30, Christian Wuerdig wrote:  
> > >>>>     The general advice floating around is that your want CPUs with  
> high
> > >>>>     clock speeds rather than more cores to reduce latency and
> > >>>>     increase IOPS
> > >>>>     for SSD setups (see also
> > >>>>     http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
> > >>>>     <http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-  
> > performance/>)  
> > >>>>     So
> > >>>>     something like a E5-2667V4 might bring better results in that
> > >>>>     situation.
> > >>>>     Also there was some talk about disabling the processor C states
> > >>>>     in order
> > >>>>     to bring latency down (something like this should be easy to  
> test:
> > >>>>     https://stackoverflow.com/a/22482722/220986
> > >>>>     <https://stackoverflow.com/a/22482722/220986>)  
> > >>>
> > >>>     I would be very careful to call this a general advice...
> > >>>
> > >>>     Although the article is interesting, it is rather single sided.
> > >>>
> > >>>     The only thing is shows that there is a lineair relation between
> > >>>     clockspeed and write or read speeds???
> > >>>     The article is rather vague on how and what is actually tested.
> > >>>
> > >>>     By just running a single OSD with no replication a lot of the
> > >>>     functionality is left out of the equation.
> > >>>     Nobody is running just 1 osD on a box in a normal cluster host.
> > >>>
> > >>>     Not using a serious SSD is another source of noise on the  
> conclusion.
> > >>>     More Queue depth can/will certainly have impact on concurrency.
> > >>>
> > >>>     I would call this an observation, and nothing more.
> > >>>
> > >>>     --WjW  
> > >>>>
> > >>>>     On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
> > >>>>     <reverend.x3@xxxxxxxxx <mailto:reverend.x3@xxxxxxxxx>
> > >>>>     <mailto:reverend.x3@xxxxxxxxx <mailto:reverend.x3@xxxxxxxxx>>>
> > >>>>     wrote:
> > >>>>
> > >>>>         Hello,
> > >>>>
> > >>>>         We are in the process of evaluating the performance of a  
> testing
> > >>>>         cluster (3 nodes) with ceph jewel. Our setup consists of:
> > >>>>         3 monitors (VMs)
> > >>>>         2 physical servers each connected with 1 JBOD running Ubuntu
> > >>>>     Server
> > >>>>         16.04
> > >>>>
> > >>>>         Each server has 32 threads @2.1GHz and 128GB RAM.
> > >>>>         The disk distribution per server is:
> > >>>>         38 * HUS726020ALS210 (SAS rotational)
> > >>>>         2 * HUSMH8010BSS200 (SAS SSD for journals)
> > >>>>         2 * ST1920FM0043 (SAS SSD for data)
> > >>>>         1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)
> > >>>>
> > >>>>         Since we don't currently have a 10Gbit switch, we test the
> > >>>>     performance
> > >>>>         with the cluster in a degraded state, the noout flag set and
> > >>>>     we mount
> > >>>>         rbd images on the powered on osd node. We confirmed that the
> > >>>>     network
> > >>>>         is not saturated during the tests.
> > >>>>
> > >>>>         We ran tests on the NVME disk and the pool created on this
> > >>>>     disk where
> > >>>>         we hoped to get the most performance without getting limited
> > >>>>     by the
> > >>>>         hardware specs since we have more disks than CPU threads.
> > >>>>
> > >>>>         The nvme disk was at first partitioned with one partition
> > >>>>     and the
> > >>>>         journal on the same disk. The performance on random 4K reads  
> > was  
> > >>>>         topped at 50K iops. We then removed the osd and partitioned
> > >>>>     with 4
> > >>>>         data partitions and 4 journals on the same disk. The  
> performance
> > >>>>         didn't increase significantly. Also, since we run read
> > >>>>     tests, the
> > >>>>         journals shouldn't cause performance issues.
> > >>>>
> > >>>>         We then ran 4 fio processes in parallel on the same rbd
> > >>>>     mounted image
> > >>>>         and the total iops reached 100K. More parallel fio processes
> > >>>>     didn't
> > >>>>         increase the measured iops.
> > >>>>
> > >>>>         Our ceph.conf is pretty basic (debug is set to 0/0 for
> > >>>>     everything) and
> > >>>>         the crushmap just defines the different buckets/rules for
> > >>>>     the disk
> > >>>>         separation (rotational, ssd, nvme) in order to create the
> > >>>>     required
> > >>>>         pools
> > >>>>
> > >>>>         Is the performance of 100.000 iops for random 4K read normal
> > >>>>     for a
> > >>>>         disk that on the same benchmark runs at more than 300K iops
> > >>>>     on the
> > >>>>         same hardware or are we missing something?
> > >>>>
> > >>>>         Best regards,
> > >>>>         Kostas
> > >>>>         _______________________________________________
> > >>>>         ceph-users mailing list
> > >>>>         ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> > >>>>     <mailto:ceph-users@xxxxxxxxxxxxxx
> > >>>>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
> > >>>>         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> > >>>>         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>     _______________________________________________
> > >>>>     ceph-users mailing list
> > >>>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> > >>>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> > >>>>  
> > >>>
> > >>>     _______________________________________________
> > >>>     ceph-users mailing list
> > >>>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
> > >>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >>>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>  
> > >>
> > >>
> > >>
> > >>  
> > >
> > >  
> > 
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com  
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com