Suggestions on new cluster

cperez@xxxxxxxxx (Carlos M. Perez) · Fri, 9 May 2014 21:17:50 +0000

Sorry about not sending to list.  Most of my other lists default to from the list, not the individual...

Thanks for the links below.  Equipment should be here next week, so we'll get to do some testing...

Carlos M. Perez
CMP Consulting Services
305-669-1515

> -----Original Message-----
> From: Christian Balzer [mailto:chibi at gol.com]
> Sent: Friday, May 9, 2014 11:27 AM
> To: ceph-users at ceph.com
> Cc: Carlos M. Perez
> Subject: Re: Suggestions on new cluster
> 
> 
> Re-added ML.
> 
> On Fri, 9 May 2014 14:50:54 +0000 Carlos M. Perez wrote:
> 
> > Christian,
> >
> > Thanks for the responses.  See below for a few
> > reposnses/comments/further questions...
> >
> > > -----Original Message-----
> > > From: Christian Balzer [mailto:chibi at gol.com]
> > > Sent: Friday, May 9, 2014 1:35 AM
> > > To: ceph-users at lists.ceph.com
> > > Cc: Carlos M. Perez
> > > Subject: Re: Suggestions on new cluster
> > >
> > >
> > > Hello,
> > >
> > > On Thu, 8 May 2014 15:14:59 +0000 Carlos M. Perez wrote:
> > >
> > > > Hi,
> > > >
> > > > We're just getting started with ceph, and were going to pull the
> > > > order to get the needed hardware ordered.  Wondering if anyone
> > > > would offer any insight/suggestions on the following setup of 3 nodes:
> > > >
> > > > Dual 6-core Xeon (L5639/L5640/X5650 depending on what we find)
> > > > 96GB RAM LSI 2008 based controller in IT mode
> > > > 4 x 1TB Nearline SAS 7.2k rpm or
> > >                                 ^^
> > > I suppose that should be an "and" up there.
> >
> > The drive we'll be using are SAS interface (as opposed to SATA) for
> > the OSD's, such as the Seagate Constellation ES.2 (ST1000NM0023),
> > which are spinning at 7.2k RPM.
> >
> It doesn't matter, a single drive is a single drive pretty much when it comes to
> spinning rust, aka HDDs.
> 
> While I pretty much despise WD, their Velociraptor drives tend to
> outperform SAS drives that are far more expensive, FWIW.
> 
> > > So a total of 2TB usable space for the whole cluster is good enough
> > > for you?
> >
> > Our current setup is a pair of nodes with local SATA (2x250GB) on an
> > ICH10R using md mirror.  Total storage (for 30 containers and 2 KVMs)
> > is under 100GB at the moment.  Why only 2TB?  There are 12TB of disk
> > (using raw capacity across the 3 nodes) Would that not yield 4TB with
> > 2 copies?  Or is there an extra local copy per node?
> >
> Sorry, 4TB indeed.
> I was going through some other calculations when I replied. ^.^;
> 
> > > > 2 x SSD (1 80GB+ for OS, 1 240GB+ for Journal, probably Intel
> > > > S3500/S3700 or Samsung 840 Pro, but maybe we skip with Firefly and
> > > > use 10k disks)
> > > What makes you think that Firefly can do better than previous
> > > versions without SSD journals?
> >
> > So far from only very light reading.  Was planning to test performance
> > without first, and see if that made a significant enough of a
> > difference to justify the cost of some SSD drives.
> >
> That's fine, you will find that your HDDs are at best half as fast (throughput
> and IOPS) as they would be standalone.
> If that is acceptable, great.
> 
> > >
> > > A more balanced approach would be to use 2 identical SSDs and put 2
> > > journals on each. That way your journal won't be slower than your
> > > actual storage disks when it comes to sequential IO.
> > > OS could easily go on the same SSDs as a software (MD) RAID1 in a
> > > different partition.
> >
> > Interesting approach.  So create a md on a small 80GB partition, and
> > tehn use the rest of the disks (raw, no md) for 2 journals each.  Will
> > consider that
> > >
> > > > Infiniband QDR for Cluster/Storage traffic
> > > Dual port Infiniband HCAs and 2 switches? Otherwise you'll have a SPOF.
> >
> > We currently have two switches, with redundant power.  Would need to
> > config ceph to use both networks.
> >
> > >
> > > > Dual Gig for network traffic
> > > >
> > > > This is pretty close to the recommended setup, and a lot more RAM
> > > > than the test setups we found.  We're going to be running proxmox
> > > > containers on this setup, and the machines are low disk util, so
> > > > extreme disk performance is not needed, which is why we we're
> > > > using 96GB.
> > > >
> > > Proxmox... With the kernel from ages long past.
> > > If you're using KVM, that kernel lacks a LOT of improvements to KVM
> > > that were implemented in newer ones.
> > > If you're using OpenVZ (no Proxmox expert here, just toyed with it
> > > once for 2 weeks) I suppose that would need RBD devices from
> > > kernelspace, where again the old 2.6.32 kernel might be less than
> > > optimal.
> >
> > Our load is so light in comparison to what others are possibly doing,
> > that I don't think this is an issue.  Proxmox has been extremely
> > stable for us, and have not had any major show stopping issues in the
> > 4 years we've been running with it.  We only run about 30 containers
> > between the two nodes, and a pair of kvm's.  Other than that it is very
> quiet.
> >
> I guess you have to test this for yourself.
> In a use case of mine the difference between 3.2 and 3.13 was massive when
> it came to particular KVM performance patterns.
> 
> As for kernel space RBDs for containers on a kernel that old, I will defer to the
> real experts aka developers.
> 
> > >
> > > I have no idea if Proxmox allows for CPU pinning, but if you share
> > > your storage nodes with VMs you really want that. Pin the VMS to
> > > CPUs/cores that are not busy handling IRQs, leave in your case about
> > > 4 cores at least free for the system and OSD tasks.
> > >
> >
> > CPU usage is very much on the low end for the machines we run.  Our
> > average CPU usage is at 2%, and peak over the last 30 days is at 15%
> > during a maintenance reboot.  Our current nodes have dual quad L5520s,
> > and we'll be changing/migrating to dual Hex-core, so we'll have a 50%
> > increase in cores per node, and we'll be more than doubling our cores
> > by going to 3 nodes from 2.  Our current RAM is 48GB and usage never
> > exceeds 8GB.  These machines are very over provisioned.  I could
> > probably go down to 72GB, since the boards have 18 slots.
> >
> If it is that light, fine.
> But keeping CPU cores free for the OS and friends is a good approach no
> matter how you look at it.
> 
> > > > I saw the performance tests run, and the only thing that "scares"
> > > > me is the btfrs using a huge chunk of CPU.  Will probably use ext4
> > > > Wondering if this was usage on a single core, or per core per
> > > > instance of the test...
> > > >
> > > BTRFS has some great features.
> > > It also is very resource intensive and has abysmal performance in
> > > many use cases, often getting worse quickly, especially with Ceph.
> > >
> > > Ext4 or XFS will serve you better in your case, no doubt.
> >
> > I was not aware of performance degradation over time.  I will need to
> > look into this a bit more closely.
> I give you:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-
> February/007824.html
> http://www.spinics.net/lists/ceph-devel/msg07249.html
> 
> > From the test scenarios I saw from
> > ceph blog
> > (http://ceph.com/community/ceph-performance-part-1-disk-controller-
> wri
> > te-throughput/) , it shows that btfrs had the best performance across
> > the board in JBOD mode, and xfs/ext4 were pretty close.  Our config is
> > going to be a
> > SAS2008 JBOD setup for the controller.  Looking a bit more closely,
> > the xfs does appear to be fairly close to ext4, and if the recommended
> > file system, I'm not seeing enough of a degradation in performance to
> > warrant using ext4 vs recoverability of xfs in case of failure.
> >
> I give you:
> https://www.mail-archive.com/ceph-users at lists.ceph.com/msg08619.html
> 
> Note that nobody ever replied to that.
> 
> Either way, XFS or ext4 will do better than BTRFS for now.
> 
> Christian
> 
> > >
> > > Regards,
> > >
> > > Christian
> > >
> > > > Thanks in advance
> > > >
> > > > Carlos M. Perez
> > > > CMP Consulting Services
> > > > 305-669-1515
> > > >
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > chibi at gol.com   	Global OnLine Japan/Fusion Communications
> > > http://www.gol.com/
> >
> > Thanks again,
> >
> > Carlos M. Perez
> > CMP Consulting Services
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/