Re: Ceph random read IOPS

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Mon, 26 Jun 2017 17:23:26 +0200

On 2017-06-26 15:34, Willem Jan Withagen wrote:

On 26-6-2017 09:01, Christian Wuerdig wrote:
Well, preferring faster clock CPUs for SSD scenarios has been floated
 several times over the last few months on this list. And realistic or
 not, Nick's and Kostas' setup are similar enough (testing single disk)
 that it's a distinct possibility.
 Anyway, as mentioned measuring the performance counters would probably
 provide more insight.

 I read the advise as:
     prefer GHz over cores.

 And especially since there is a sort of balance between either GHz or
 cores, that can be an expensive one. Getting both means you have to pay
 relatively substantial more money.

 And for an average Ceph server with plenty OSDs, I personally just don't
 buy that. There you'd have to look at the total throughput of the the
 system, and latency is only one of the many factors.

 Let alone in a cluster with several hosts (and or racks). There the
 latency is dictated by the network. So a bad choice of network card or
 switch will out do any extra cycles that your CPU can burn.

 I think that just testing 1 OSD is testing artifacts, and has very
 little to do with running an actual ceph cluster.

 So if one would like to test this, the test setup should be something
 like: 3 hosts with something like 3 disks per host, min_disk=2  and a
 nice workload.
 Then turn the GHz-knob and see what happens with client latency and
 throughput.

 --WjW

In a high concurrency/queue depth situation, which is probably the most common workload, there is no question that adding more cores will increase IOPS almost linearly in case you have enough disk and network bandwidth, ie your disk and network % utilization is low and your cpu is near 100%. Adding more cores is also more economic to increase IOPS versus increasing frequency.
But adding more cores will not lower latency below the value you get from the QD=1 test. To achieve lower latency you need faster cpu freq. Yes it is expensive and as you said you need lower latency switches and so on but you just have to pay more to achieve this. 

/Maged

On Sun, Jun 25, 2017 at 4:53 AM, Willem Jan Withagen <wjw@xxxxxxxxxxx
 <mailto:wjw@xxxxxxxxxxx>> wrote:

     Op 24 jun. 2017 om 14:17 heeft Maged Mokhtar <mmokhtar@xxxxxxxxxxx
     <mailto:mmokhtar@xxxxxxxxxxx>> het volgende geschreven:

    My understanding was this test is targeting latency more than
     IOPS. This is probably why its was run using QD=1. It also makes
     sense that cpu freq will be more important than cores. 

     But then it is not generic enough to be used as an advise!
     It is just a line in 3D-space. 
     As there are so many

     --WjW

    On 2017-06-24 12:52, Willem Jan Withagen wrote:

    On 24-6-2017 05:30, Christian Wuerdig wrote:
    The general advice floating around is that your want CPUs with high
     clock speeds rather than more cores to reduce latency and
     increase IOPS
     for SSD setups (see also
     http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
     <http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/>)
     So
     something like a E5-2667V4 might bring better results in that
     situation.
     Also there was some talk about disabling the processor C states
     in order
     to bring latency down (something like this should be easy to test:
     https://stackoverflow.com/a/22482722/220986
     <https://stackoverflow.com/a/22482722/220986>)

     I would be very careful to call this a general advice...

     Although the article is interesting, it is rather single sided.

     The only thing is shows that there is a lineair relation between
     clockspeed and write or read speeds???
     The article is rather vague on how and what is actually tested.

     By just running a single OSD with no replication a lot of the
     functionality is left out of the equation.
     Nobody is running just 1 osD on a box in a normal cluster host.

     Not using a serious SSD is another source of noise on the conclusion.
     More Queue depth can/will certainly have impact on concurrency.

     I would call this an observation, and nothing more.

     --WjW

     On Sat, Jun 24, 2017 at 1:28 AM, Kostas Paraskevopoulos
     <reverend.x3@xxxxxxxxx <mailto:reverend.x3@xxxxxxxxx>
     <mailto:reverend.x3@xxxxxxxxx <mailto:reverend.x3@xxxxxxxxx>>>
     wrote:

         Hello,

         We are in the process of evaluating the performance of a testing
         cluster (3 nodes) with ceph jewel. Our setup consists of:
         3 monitors (VMs)
         2 physical servers each connected with 1 JBOD running Ubuntu
     Server
         16.04

         Each server has 32 threads @2.1GHz and 128GB RAM.
         The disk distribution per server is:
         38 * HUS726020ALS210 (SAS rotational)
         2 * HUSMH8010BSS200 (SAS SSD for journals)
         2 * ST1920FM0043 (SAS SSD for data)
         1 * INTEL SSDPEDME012T4 (NVME measured with fio ~300K iops)

         Since we don't currently have a 10Gbit switch, we test the
     performance
         with the cluster in a degraded state, the noout flag set and
     we mount
         rbd images on the powered on osd node. We confirmed that the
     network
         is not saturated during the tests.

         We ran tests on the NVME disk and the pool created on this
     disk where
         we hoped to get the most performance without getting limited
     by the
         hardware specs since we have more disks than CPU threads.

         The nvme disk was at first partitioned with one partition
     and the
         journal on the same disk. The performance on random 4K reads was
         topped at 50K iops. We then removed the osd and partitioned
     with 4
         data partitions and 4 journals on the same disk. The performance
         didn't increase significantly. Also, since we run read
     tests, the
         journals shouldn't cause performance issues.

         We then ran 4 fio processes in parallel on the same rbd
     mounted image
         and the total iops reached 100K. More parallel fio processes
     didn't
         increase the measured iops.

         Our ceph.conf is pretty basic (debug is set to 0/0 for
     everything) and
         the crushmap just defines the different buckets/rules for
     the disk
         separation (rotational, ssd, nvme) in order to create the
     required
         pools

         Is the performance of 100.000 iops for random 4K read normal
     for a
         disk that on the same benchmark runs at more than 300K iops
     on the
         same hardware or are we missing something?

         Best regards,
         Kostas
         _______________________________________________
         ceph-users mailing list
         ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
     <mailto:ceph-users@xxxxxxxxxxxxxx
     <mailto:ceph-users@xxxxxxxxxxxxxx>>
         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>

     _______________________________________________
     ceph-users mailing list
     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

     _______________________________________________
     ceph-users mailing list
     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com