Re: OSD GHz vs. Cores Question

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Mon, 24 Aug 2015 11:54:50 -0600

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Thanks to all the responses. There has been more to think about which
is what I was looking for.

We have MySQL running on this cluster so we will have some VMs with
fairly low queue depths. Our Ops teams are not excited about
unplugging cables and pulling servers to replace fixed disks, so we
are looking at hot swap options.

I'll try and do some testing in our lab, but I won't be able to get a
very good spread of data due to clock and core limitations in the
existing hardware.
- ----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1

On Sat, Aug 22, 2015 at 2:42 PM, Luis Periquito  wrote:
> I've been meaning to write an email with the experience we had at the
> company I work. For the lack of a more complete one I'll just tell some of
> the findings. Please note these are my experiences, and are correct for my
> environment. The clients are running on openstack, and all servers are
> trusty. Tests were made with Hammer (0.94.2).
>
> TLDR: if performance is your objective buy 1S boxes with high frequency,
> good journal SSDs, and not many SSDs. Also change the cpu to performance
> mode, instead the default ondemand. And don't forget 10Gig is a must.
> Replicated pools are also a must for performance.
>
> We wanted to have a small cluster (30TB RAW), performance was important
> (IOPS and latency), network was designed to be 10G copper with BGP attached
> hosts. There was complete leeway in design and some in budget.
>
> Starting with the network that required us to only create a single network,
> but both links are usable - iperf between boxes is usually around
> 17-19Gbits.
>
> We could choose the nodes, we evaluated dual cpu and single cpu nodes. The
> dual cpus would have 24 2.5'' drive bays on a 2U chassis whereas the single
> were 8 2.5'' drive bays on a 1U chassis. Long story short we chose the
> single cpu (E3 1241 v3). On the CPU all the tests we did with the scaling
> governors shown that "performance" would give us a 30-50% boost in IOPS.
> Latency also improved but not by much. The downside was that each system
> increased power usage by 5W (!?).
>
> For the difference in price (£80) we bought the boxes with 32G of ram.
>
> As for the disks, as we wanted fast IO we had to go with SSDs. Due to the
> budget we had we went with 4x Samsung 850 PRO + 1x Intel S3710 200G. We also
> tested the P3600, but one of the critical IO clients had far worse
> performance with it. From benchmarking the write performance is that of the
> Intel SSD. We made tests with Intel SSD with journal + different Intel SSD
> with data and performance was within margin for error the same that Intel
> SSD for journal + Samsung SSD for data. Single SSD performance was slightly
> lower with either one (around 10%).
>
> From what I've seen: on very big sequential read and write I can get up to
> 700-800 MBps. On random IO (8k, random writes, reads or mixed workloads) we
> still haven't finished all the tests, but so far it indicates the SSDs are
> the bottleneck on the writes, and ceph latency on the reads. However we've
> been able to extract 400 MBps read IO with 4 clients, each doing 32 threads.
> I don't have the numbers here but that represents around 50k IOPS out of a
> smallish cluster.
>
> Stuff we still have to do revolves around jemalloc vs tcmalloc - trusty has
> the bug on the thread cache bytes variable. Also we still have to test
> various tunable options, like threads, caches, etc...
>
> Hope this helps.
>
>
> On Sat, Aug 22, 2015 at 4:45 PM, Nick Fisk  wrote:
>>
>> Another thing that is probably worth considering is the practical side as
>> well. A lot of the Xeon E5 boards tend to have more SAS/SATA ports and
>> onboard 10GB, this can make quite a difference to the overall cost of the
>> solution if you need to buy extra PCI-E cards.
>>
>> Unless I've missed one, I've not spotted a Xeon-D board with a large
>> amount
>> of onboard sata/sas ports. Please let me know if such a system exists as I
>> would be very interested.
>>
>> We settled on the Hadoop version of the Supermicro Fat Twin. 12 x 3.5"
>> disks
>> + 2x 2.5 SSD's per U, onboard 10GB-T and the fact they share chassis and
>> PSU's keeps the price down. For bulk storage one of these with a single 8
>> core low clocked E5 Xeon is ideal in my mind. I did a spreadsheet working
>> out U space, power and cost per GB for several different types of server,
>> this solution came out ahead in nearly every category.
>>
>> If there is a requirement for a high perf SSD tier I would probably look
>> at
>> dedicated SSD nodes as I doubt you could cram enough CPU power into a
>> single
>> server to drive 12xSSD's.
>>
>> You mentioned low latency was a key requirement, is this always going to
>> be
>> at low queue depths? If you just need very low latency but won't actually
>> be
>> driving the SSD's very hard you will probably find a very highly clocked
>> E3
>> is the best bet with 2-4 SSD's per node. However if you drive the SSD's
>> hard, a single one can easily max out several cores.
>>
>> > -----Original Message-----
>> > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> > Mark Nelson
>> > Sent: 22 August 2015 00:00
>> > To: ceph-users@xxxxxxxxxxxxxx
>> > Subject: Re:  OSD GHz vs. Cores Question
>> >
>> > FWIW, we recently were looking at a couple of different options for the
>> > machines in our test lab that run the nightly QA suite jobs via
>> teuthology.
>> >
>> >  From a cost/benefit perspective, I think it really comes down to
>> something
>> > like a XEON E3-12XXv3 or the new XEON D-1540, each of which have
>> > advantages/disadvantages.
>> >
>> > We were very tempted by the Xeon D but it was still just a little too
>> > new
>> for
>> > us so we ended up going with servers using more standard E3 processors.
>> > The Xeon D setup was slightly cheaper, offers more theoretical
>> performance,
>> > and is way lower power, but at a much slower per-core clock speed.  It's
>> likely
>> > that for our functional tests that clock speed may be more important
>> > than
>> > the cores (but on these machines we'll only have 4 OSDs per server).
>> >
>> > Anyway, I suspect that either setup will probably work fairly well for
>> > spinners.  SSDs get trickier.
>> >
>> > Mark
>> >
>> > On 08/21/2015 05:46 PM, Robert LeBlanc wrote:
>> > > -----BEGIN PGP SIGNED MESSAGE-----
>> > > Hash: SHA256
>> > >
>> > > We are looking to purchase our next round of Ceph hardware and based
>> > > off the work by Nick Fisk [1] our previous thought of cores over clock
>> > > is being revisited.
>> > >
>> > > I have two camps of thoughts and would like to get some feedback, even
>> > > if it is only theoretical. We currently have 12 disks per node (2
>> > > SSD/10 4TB spindle), but we may adjust that to 4/8. SSD would be used
>> > > for journals and cache tier (when [2] and fstrim are resolved). We
>> > > also want to stay with a single processor for cost, power and NUMA
>> > > considerations.
>> > >
>> > > 1. For 12 disks with three threads each (2 client and 1 background),
>> > > lots of slower cores would allow I/O (ceph code) to be scheduled as
>> > > soon as a core is available.
>> > >
>> > > 2. Faster cores would get through the Ceph code faster but there would
>> > > be less cores and so some I/O may have to wait to be scheduled.
>> > >
>> > > I'm leaning towards #2 for these reasons, please expose anything I may
>> > > be missing:
>> > > * The latency will only really be improved in the SSD I/O with faster
>> > > clock speed, all writes and any reads from the cache tier. So 8 fast
>> > > cores might be sufficient, reading from spindle and flushing the
>> > > journal will have a "substantial" amount of sleep to allow other Ceph
>> > > I/O to be hyperthreaded.
>> > > * Even though SSDs are much faster than spindles they are still orders
>> > > of magnitude slower than the processor, so it is still possible to get
>> > > more lines of code executed between SSD I/O with a faster processor
>> > > even with less cores.
>> > > * As the Ceph code is improved through optimization and less code has
>> > > to be executed for each I/O, faster clock speeds will only provide
>> > > even more benefit (lower latency, less waiting for cores) as the delay
>> > > shifts more from CPU to disk.
>> > >
>> > > Since our workload is typically small I/O 12K-18K, latency means a lot
>> > > to our performance.
>> > >
>> > > Our current processors are Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz
>> > >
>> > > [1] http://www.spinics.net/lists/ceph-users/msg19305.html
>> > > [2] http://article.gmane.org/gmane.comp.file-systems.ceph.user/22713
>> > >
>> > > Thanks,
>> > > - ----------------
>> > > Robert LeBlanc
>> > > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> > > -----BEGIN PGP SIGNATURE-----
>> > > Version: Mailvelope v1.0.0
>> > > Comment: https://www.mailvelope.com
>> > >
>> > >
>> > wsFcBAEBCAAQBQJV16pfCRDmVDuy+mK58QAA9cgP/RwsZESriIMWZHeC0P
>> > mS
>> > > CH8iEFCXCRCzvW+lYMwB9FOvPmBLlhayp39Z93Djv3sef02t3Z9NFPq7fUmb
>> > >
>> > ZwZ9SnH9oVmRElbQyNtt8MfJ2cqXRU6JtYsTHnZ5G0+sFvv+BY+mYD89nULw
>> > > xwbsosUCBA9Rp8geq++XLSbuEBt8AfreYaSBzY1kg51Ovtmb97R0hB7bQBWP
>> > >
>> > oUgi/ET24w4sUqLSo4WBNBZ0WeWsRA4w5PEzHk28ynBY0B/GAtiGadtZWOF
>> > X
>> > >
>> > 6bNz3KjMbLEWU9UF+7WyL+ppru6RIUZeayFp3tdIzqQdMbeBDPO54miOezw
>> > v
>> > > 9iFNuzxj2P6jqlp18W2SZYN2JF5qCgrG5mXlU2bOM9k4IlQAqG2V3iD/rSF8
>> > > LmL/FSzU6C4k8PffaNis/grZAtjN4tCLRAoWUmsXSRW1NpSNm13l6wJfg5xq
>> > >
>> > XGLQ4CfGMV/o3a1Oz1M7jfMLWb0b6TeYlqC8eeHUp9ipa8IaVKsGNDJYQOn
>> > M
>> > >
>> > LvyRuyB7yIM6dEXmJjE5ZQPwbh0se3+hUhNolQ949aKrY2u8Q2kHhKqOyzuw
>> > >
>> > EAAyHkeqBtAZFW+DActHYVCi9lJO8shmeWuVKxAuzKYJGYzD8yVIS+AVqZ2k
>> > > OH2/NNAXzBKefsL1gd8DT4QuYqDoEN2arO+PN0vZeEruQ4vg6qZvabqeB/4o
>> > > kUd4
>> > > =F5Sx
>> > > -----END PGP SIGNATURE-----
>> > > _______________________________________________
>> > > ceph-users mailing list
>> > > ceph-users@xxxxxxxxxxxxxx
>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > >
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-----BEGIN PGP SIGNATURE-----
Version: Mailvelope v1.0.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV21pnCRDmVDuy+mK58QAALvcP/1VSY1Pyl2JU1IdLspf9
ayhQ5dtbz5nUM1vee2IfXVllWUYWn9DxhvhTl33L7Zzlm72/jPNst5hIUmb7
W7tBdCil8Ey0GhVYYcGr+21wEnmd7R33TECagUjSGWo6LLaCiVIuG7I4NWnv
URd44/CaJ+M07n+xgoC8EOaLZRVfpRGPfVT5awt86mCx3Fe0iv7KoK3dN6HI
vBXtczPgrZNRDr8zillg3BpY29CN9H6p/vo3vIhJrrNya6DOM4+QkrfA2ut5
wCCF5h+6cS7/bzo4C7u/CvhxHjjP+tqD1VSG0GaMXs1NARdT6eO++NNjSJka
VaRUyDwOQQYVsqzfc9V/fvLs9sudT+H65bRXc75bB3VCfrB0q7v86X5Gdmau
0hR94AAXzYxTNV5fxp1iknzd5TjxaoG7EtNOYhy6chQvYovWtF/yKJXFWWx7
piOywCUqOVocHlSgcZA8GV32s+Uv89QWEPXlBMYBLY5Sef5ZipPN4TyyEu96
5t1KdvwlrKeLnwqSxeydUjNuJSM9k3Zt9fw79smlcCDw37shjffA0LJLiifN
039iVKm3Ad8VE7x36LXoC3Arknlyt7D2g/dwkKMbRlq8NmgaDhO45LlsqiPu
DsTbn+DjhRrgvuY7lLhdRpKhI+in1IisV5fqZusydiPS/ID1utRJq4bePVFI
Lz0B
=fnXc
-----END PGP SIGNATURE-----
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com