Suggestions on new cluster

chibi@xxxxxxx (Christian Balzer) · Sat, 10 May 2014 00:26:30 +0900

Re-added ML.

On Fri, 9 May 2014 14:50:54 +0000 Carlos M. Perez wrote:

> Christian,
> 
> Thanks for the responses.  See below for a few
> reposnses/comments/further questions...
> 
> > -----Original Message-----
> > From: Christian Balzer [mailto:chibi at gol.com]
> > Sent: Friday, May 9, 2014 1:35 AM
> > To: ceph-users at lists.ceph.com
> > Cc: Carlos M. Perez
> > Subject: Re: Suggestions on new cluster
> > 
> > 
> > Hello,
> > 
> > On Thu, 8 May 2014 15:14:59 +0000 Carlos M. Perez wrote:
> > 
> > > Hi,
> > >
> > > We're just getting started with ceph, and were going to pull the
> > > order to get the needed hardware ordered.  Wondering if anyone would
> > > offer any insight/suggestions on the following setup of 3 nodes:
> > >
> > > Dual 6-core Xeon (L5639/L5640/X5650 depending on what we find) 96GB
> > > RAM LSI 2008 based controller in IT mode
> > > 4 x 1TB Nearline SAS 7.2k rpm or
> >                                 ^^
> > I suppose that should be an "and" up there.
> 
> The drive we'll be using are SAS interface (as opposed to SATA) for the
> OSD's, such as the Seagate Constellation ES.2 (ST1000NM0023), which are
> spinning at 7.2k RPM.  
> 
It doesn't matter, a single drive is a single drive pretty much when it
comes to spinning rust, aka HDDs. 

While I pretty much despise WD, their Velociraptor drives tend to
outperform SAS drives that are far more expensive, FWIW.

> > So a total of 2TB usable space for the whole cluster is good enough
> > for you?
> 
> Our current setup is a pair of nodes with local SATA (2x250GB) on an
> ICH10R using md mirror.  Total storage (for 30 containers and 2 KVMs) is
> under 100GB at the moment.  Why only 2TB?  There are 12TB of disk (using
> raw capacity across the 3 nodes) Would that not yield 4TB with 2
> copies?  Or is there an extra local copy per node?
> 
Sorry, 4TB indeed. 
I was going through some other calculations when I replied. ^.^;

> > > 2 x SSD (1 80GB+ for OS, 1 240GB+ for Journal, probably Intel
> > > S3500/S3700 or Samsung 840 Pro, but maybe we skip with Firefly and
> > > use 10k disks)
> > What makes you think that Firefly can do better than previous versions
> > without SSD journals?
> 
> So far from only very light reading.  Was planning to test performance
> without first, and see if that made a significant enough of a difference
> to justify the cost of some SSD drives.  
>
That's fine, you will find that your HDDs are at best half as fast
(throughput and IOPS) as they would be standalone. 
If that is acceptable, great.

> > 
> > A more balanced approach would be to use 2 identical SSDs and put 2
> > journals on each. That way your journal won't be slower than your
> > actual storage disks when it comes to sequential IO.
> > OS could easily go on the same SSDs as a software (MD) RAID1 in a
> > different partition.
> 
> Interesting approach.  So create a md on a small 80GB partition, and
> tehn use the rest of the disks (raw, no md) for 2 journals each.  Will
> consider that
> > 
> > > Infiniband QDR for Cluster/Storage traffic
> > Dual port Infiniband HCAs and 2 switches? Otherwise you'll have a SPOF.
> 
> We currently have two switches, with redundant power.  Would need to
> config ceph to use both networks.
> 
> > 
> > > Dual Gig for network traffic
> > >
> > > This is pretty close to the recommended setup, and a lot more RAM
> > > than the test setups we found.  We're going to be running proxmox
> > > containers on this setup, and the machines are low disk util, so
> > > extreme disk performance is not needed, which is why we we're using
> > > 96GB.
> > >
> > Proxmox... With the kernel from ages long past.
> > If you're using KVM, that kernel lacks a LOT of improvements to KVM
> > that were implemented in newer ones.
> > If you're using OpenVZ (no Proxmox expert here, just toyed with it once
> > for 2 weeks) I suppose that would need RBD devices from kernelspace,
> > where
> > again the old 2.6.32 kernel might be less than optimal.
> 
> Our load is so light in comparison to what others are possibly doing,
> that I don't think this is an issue.  Proxmox has been extremely stable
> for us, and have not had any major show stopping issues in the 4 years
> we've been running with it.  We only run about 30 containers between the
> two nodes, and a pair of kvm's.  Other than that it is very quiet.
> 
I guess you have to test this for yourself.
In a use case of mine the difference between 3.2 and 3.13 was massive when
it came to particular KVM performance patterns.

As for kernel space RBDs for containers on a kernel that old, I will defer
to the real experts aka developers.

> > 
> > I have no idea if Proxmox allows for CPU pinning, but if you share your
> > storage nodes with VMs you really want that. Pin the VMS to CPUs/cores
> > that are not busy handling IRQs, leave in your case about 4 cores at
> > least free for the system and OSD tasks.
> > 
> 
> CPU usage is very much on the low end for the machines we run.  Our
> average CPU usage is at 2%, and peak over the last 30 days is at 15%
> during a maintenance reboot.  Our current nodes have dual quad L5520s,
> and we'll be changing/migrating to dual Hex-core, so we'll have a 50%
> increase in cores per node, and we'll be more than doubling our cores by
> going to 3 nodes from 2.  Our current RAM is 48GB and usage never
> exceeds 8GB.  These machines are very over provisioned.  I could
> probably go down to 72GB, since the boards have 18 slots.
> 
If it is that light, fine.
But keeping CPU cores free for the OS and friends is a good approach no
matter how you look at it.

> > > I saw the performance tests run, and the only thing that "scares" me
> > > is the btfrs using a huge chunk of CPU.  Will probably use ext4
> > > Wondering if this was usage on a single core, or per core per
> > > instance of the test...
> > >
> > BTRFS has some great features.
> > It also is very resource intensive and has abysmal performance in many
> > use cases, often getting worse quickly, especially with Ceph.
> > 
> > Ext4 or XFS will serve you better in your case, no doubt.
> 
> I was not aware of performance degradation over time.  I will need to
> look into this a bit more closely.  
I give you:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-February/007824.html
http://www.spinics.net/lists/ceph-devel/msg07249.html

> From the test scenarios I saw from
> ceph blog
> (http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/) ,
> it shows that btfrs had the best performance across the board in JBOD
> mode, and xfs/ext4 were pretty close.  Our config is going to be a
> SAS2008 JBOD setup for the controller.  Looking a bit more closely, the
> xfs does appear to be fairly close to ext4, and if the recommended file
> system, I'm not seeing enough of a degradation in performance to warrant
> using ext4 vs recoverability of xfs in case of failure.  
> 
I give you:
https://www.mail-archive.com/ceph-users at lists.ceph.com/msg08619.html

Note that nobody ever replied to that.

Either way, XFS or ext4 will do better than BTRFS for now.

Christian

> > 
> > Regards,
> > 
> > Christian
> > 
> > > Thanks in advance
> > >
> > > Carlos M. Perez
> > > CMP Consulting Services
> > > 305-669-1515
> > >
> > 
> > 
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi at gol.com   	Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> 
> Thanks again,
> 
> Carlos M. Perez
> CMP Consulting Services
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/