Re: OSD GHz vs. Cores Question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I've been meaning to write an email with the experience we had at the company I work. For the lack of a more complete one I'll just tell some of the findings. Please note these are my experiences, and are correct for my environment. The clients are running on openstack, and all servers are trusty. Tests were made with Hammer (0.94.2).

TLDR: if performance is your objective buy 1S boxes with high frequency, good journal SSDs, and not many SSDs. Also change the cpu to performance mode, instead the default ondemand. And don't forget 10Gig is a must. Replicated pools are also a must for performance.

We wanted to have a small cluster (30TB RAW), performance was important (IOPS and latency), network was designed to be 10G copper with BGP attached hosts. There was complete leeway in design and some in budget.

Starting with the network that required us to only create a single network, but both links are usable - iperf between boxes is usually around 17-19Gbits.

We could choose the nodes, we evaluated dual cpu and single cpu nodes. The dual cpus would have 24 2.5'' drive bays on a 2U chassis whereas the single were 8 2.5'' drive bays on a 1U chassis. Long story short we chose the single cpu (E3 1241 v3). On the CPU all the tests we did with the scaling governors shown that "performance" would give us a 30-50% boost in IOPS. Latency also improved but not by much. The downside was that each system increased power usage by 5W (!?).

For the difference in price (£80) we bought the boxes with 32G of ram.

As for the disks, as we wanted fast IO we had to go with SSDs. Due to the budget we had we went with 4x Samsung 850 PRO + 1x Intel S3710 200G. We also tested the P3600, but one of the critical IO clients had far worse performance with it. From benchmarking the write performance is that of the Intel SSD. We made tests with Intel SSD with journal + different Intel SSD with data and performance was within margin for error the same that Intel SSD for journal + Samsung SSD for data. Single SSD performance was slightly lower with either one (around 10%).

From what I've seen: on very big sequential read and write I can get up to 700-800 MBps. On random IO (8k, random writes, reads or mixed workloads) we still haven't finished all the tests, but so far it indicates the SSDs are the bottleneck on the writes, and ceph latency on the reads. However we've been able to extract 400 MBps read IO with 4 clients, each doing 32 threads. I don't have the numbers here but that represents around 50k IOPS out of a smallish cluster.

Stuff we still have to do revolves around jemalloc vs tcmalloc - trusty has the bug on the thread cache bytes variable. Also we still have to test various tunable options, like threads, caches, etc...

Hope this helps.


On Sat, Aug 22, 2015 at 4:45 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
Another thing that is probably worth considering is the practical side as
well. A lot of the Xeon E5 boards tend to have more SAS/SATA ports and
onboard 10GB, this can make quite a difference to the overall cost of the
solution if you need to buy extra PCI-E cards.

Unless I've missed one, I've not spotted a Xeon-D board with a large amount
of onboard sata/sas ports. Please let me know if such a system exists as I
would be very interested.

We settled on the Hadoop version of the Supermicro Fat Twin. 12 x 3.5" disks
+ 2x 2.5 SSD's per U, onboard 10GB-T and the fact they share chassis and
PSU's keeps the price down. For bulk storage one of these with a single 8
core low clocked E5 Xeon is ideal in my mind. I did a spreadsheet working
out U space, power and cost per GB for several different types of server,
this solution came out ahead in nearly every category.

If there is a requirement for a high perf SSD tier I would probably look at
dedicated SSD nodes as I doubt you could cram enough CPU power into a single
server to drive 12xSSD's.

You mentioned low latency was a key requirement, is this always going to be
at low queue depths? If you just need very low latency but won't actually be
driving the SSD's very hard you will probably find a very highly clocked E3
is the best bet with 2-4 SSD's per node. However if you drive the SSD's
hard, a single one can easily max out several cores.

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Mark Nelson
> Sent: 22 August 2015 00:00
> To: ceph-users@xxxxxxxxxxxxxx
> Subject: Re: OSD GHz vs. Cores Question
>
> FWIW, we recently were looking at a couple of different options for the
> machines in our test lab that run the nightly QA suite jobs via
teuthology.
>
>  From a cost/benefit perspective, I think it really comes down to
something
> like a XEON E3-12XXv3 or the new XEON D-1540, each of which have
> advantages/disadvantages.
>
> We were very tempted by the Xeon D but it was still just a little too new
for
> us so we ended up going with servers using more standard E3 processors.
> The Xeon D setup was slightly cheaper, offers more theoretical
performance,
> and is way lower power, but at a much slower per-core clock speed.  It's
likely
> that for our functional tests that clock speed may be more important than
> the cores (but on these machines we'll only have 4 OSDs per server).
>
> Anyway, I suspect that either setup will probably work fairly well for
> spinners.  SSDs get trickier.
>
> Mark
>
> On 08/21/2015 05:46 PM, Robert LeBlanc wrote:
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA256
> >
> > We are looking to purchase our next round of Ceph hardware and based
> > off the work by Nick Fisk [1] our previous thought of cores over clock
> > is being revisited.
> >
> > I have two camps of thoughts and would like to get some feedback, even
> > if it is only theoretical. We currently have 12 disks per node (2
> > SSD/10 4TB spindle), but we may adjust that to 4/8. SSD would be used
> > for journals and cache tier (when [2] and fstrim are resolved). We
> > also want to stay with a single processor for cost, power and NUMA
> > considerations.
> >
> > 1. For 12 disks with three threads each (2 client and 1 background),
> > lots of slower cores would allow I/O (ceph code) to be scheduled as
> > soon as a core is available.
> >
> > 2. Faster cores would get through the Ceph code faster but there would
> > be less cores and so some I/O may have to wait to be scheduled.
> >
> > I'm leaning towards #2 for these reasons, please expose anything I may
> > be missing:
> > * The latency will only really be improved in the SSD I/O with faster
> > clock speed, all writes and any reads from the cache tier. So 8 fast
> > cores might be sufficient, reading from spindle and flushing the
> > journal will have a "substantial" amount of sleep to allow other Ceph
> > I/O to be hyperthreaded.
> > * Even though SSDs are much faster than spindles they are still orders
> > of magnitude slower than the processor, so it is still possible to get
> > more lines of code executed between SSD I/O with a faster processor
> > even with less cores.
> > * As the Ceph code is improved through optimization and less code has
> > to be executed for each I/O, faster clock speeds will only provide
> > even more benefit (lower latency, less waiting for cores) as the delay
> > shifts more from CPU to disk.
> >
> > Since our workload is typically small I/O 12K-18K, latency means a lot
> > to our performance.
> >
> > Our current processors are Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz
> >
> > [1] http://www.spinics.net/lists/ceph-users/msg19305.html
> > [2] http://article.gmane.org/gmane.comp.file-systems.ceph.user/22713
> >
> > Thanks,
> > - ----------------
> > Robert LeBlanc
> > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> > -----BEGIN PGP SIGNATURE-----
> > Version: Mailvelope v1.0.0
> > Comment: https://www.mailvelope.com
> >
> >
> wsFcBAEBCAAQBQJV16pfCRDmVDuy+mK58QAA9cgP/RwsZESriIMWZHeC0P
> mS
> > CH8iEFCXCRCzvW+lYMwB9FOvPmBLlhayp39Z93Djv3sef02t3Z9NFPq7fUmb
> >
> ZwZ9SnH9oVmRElbQyNtt8MfJ2cqXRU6JtYsTHnZ5G0+sFvv+BY+mYD89nULw
> > xwbsosUCBA9Rp8geq++XLSbuEBt8AfreYaSBzY1kg51Ovtmb97R0hB7bQBWP
> >
> oUgi/ET24w4sUqLSo4WBNBZ0WeWsRA4w5PEzHk28ynBY0B/GAtiGadtZWOF
> X
> >
> 6bNz3KjMbLEWU9UF+7WyL+ppru6RIUZeayFp3tdIzqQdMbeBDPO54miOezw
> v
> > 9iFNuzxj2P6jqlp18W2SZYN2JF5qCgrG5mXlU2bOM9k4IlQAqG2V3iD/rSF8
> > LmL/FSzU6C4k8PffaNis/grZAtjN4tCLRAoWUmsXSRW1NpSNm13l6wJfg5xq
> >
> XGLQ4CfGMV/o3a1Oz1M7jfMLWb0b6TeYlqC8eeHUp9ipa8IaVKsGNDJYQOn
> M
> >
> LvyRuyB7yIM6dEXmJjE5ZQPwbh0se3+hUhNolQ949aKrY2u8Q2kHhKqOyzuw
> >
> EAAyHkeqBtAZFW+DActHYVCi9lJO8shmeWuVKxAuzKYJGYzD8yVIS+AVqZ2k
> > OH2/NNAXzBKefsL1gd8DT4QuYqDoEN2arO+PN0vZeEruQ4vg6qZvabqeB/4o
> > kUd4
> > =F5Sx
> > -----END PGP SIGNATURE-----
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux