Re-added ML. On Fri, 9 May 2014 14:50:54 +0000 Carlos M. Perez wrote: > Christian, > > Thanks for the responses. See below for a few > reposnses/comments/further questions... > > > -----Original Message----- > > From: Christian Balzer [mailto:chibi at gol.com] > > Sent: Friday, May 9, 2014 1:35 AM > > To: ceph-users at lists.ceph.com > > Cc: Carlos M. Perez > > Subject: Re: Suggestions on new cluster > > > > > > Hello, > > > > On Thu, 8 May 2014 15:14:59 +0000 Carlos M. Perez wrote: > > > > > Hi, > > > > > > We're just getting started with ceph, and were going to pull the > > > order to get the needed hardware ordered. Wondering if anyone would > > > offer any insight/suggestions on the following setup of 3 nodes: > > > > > > Dual 6-core Xeon (L5639/L5640/X5650 depending on what we find) 96GB > > > RAM LSI 2008 based controller in IT mode > > > 4 x 1TB Nearline SAS 7.2k rpm or > > ^^ > > I suppose that should be an "and" up there. > > The drive we'll be using are SAS interface (as opposed to SATA) for the > OSD's, such as the Seagate Constellation ES.2 (ST1000NM0023), which are > spinning at 7.2k RPM. > It doesn't matter, a single drive is a single drive pretty much when it comes to spinning rust, aka HDDs. While I pretty much despise WD, their Velociraptor drives tend to outperform SAS drives that are far more expensive, FWIW. > > So a total of 2TB usable space for the whole cluster is good enough > > for you? > > Our current setup is a pair of nodes with local SATA (2x250GB) on an > ICH10R using md mirror. Total storage (for 30 containers and 2 KVMs) is > under 100GB at the moment. Why only 2TB? There are 12TB of disk (using > raw capacity across the 3 nodes) Would that not yield 4TB with 2 > copies? Or is there an extra local copy per node? > Sorry, 4TB indeed. I was going through some other calculations when I replied. ^.^; > > > 2 x SSD (1 80GB+ for OS, 1 240GB+ for Journal, probably Intel > > > S3500/S3700 or Samsung 840 Pro, but maybe we skip with Firefly and > > > use 10k disks) > > What makes you think that Firefly can do better than previous versions > > without SSD journals? > > So far from only very light reading. Was planning to test performance > without first, and see if that made a significant enough of a difference > to justify the cost of some SSD drives. > That's fine, you will find that your HDDs are at best half as fast (throughput and IOPS) as they would be standalone. If that is acceptable, great. > > > > A more balanced approach would be to use 2 identical SSDs and put 2 > > journals on each. That way your journal won't be slower than your > > actual storage disks when it comes to sequential IO. > > OS could easily go on the same SSDs as a software (MD) RAID1 in a > > different partition. > > Interesting approach. So create a md on a small 80GB partition, and > tehn use the rest of the disks (raw, no md) for 2 journals each. Will > consider that > > > > > Infiniband QDR for Cluster/Storage traffic > > Dual port Infiniband HCAs and 2 switches? Otherwise you'll have a SPOF. > > We currently have two switches, with redundant power. Would need to > config ceph to use both networks. > > > > > > Dual Gig for network traffic > > > > > > This is pretty close to the recommended setup, and a lot more RAM > > > than the test setups we found. We're going to be running proxmox > > > containers on this setup, and the machines are low disk util, so > > > extreme disk performance is not needed, which is why we we're using > > > 96GB. > > > > > Proxmox... With the kernel from ages long past. > > If you're using KVM, that kernel lacks a LOT of improvements to KVM > > that were implemented in newer ones. > > If you're using OpenVZ (no Proxmox expert here, just toyed with it once > > for 2 weeks) I suppose that would need RBD devices from kernelspace, > > where > > again the old 2.6.32 kernel might be less than optimal. > > Our load is so light in comparison to what others are possibly doing, > that I don't think this is an issue. Proxmox has been extremely stable > for us, and have not had any major show stopping issues in the 4 years > we've been running with it. We only run about 30 containers between the > two nodes, and a pair of kvm's. Other than that it is very quiet. > I guess you have to test this for yourself. In a use case of mine the difference between 3.2 and 3.13 was massive when it came to particular KVM performance patterns. As for kernel space RBDs for containers on a kernel that old, I will defer to the real experts aka developers. > > > > I have no idea if Proxmox allows for CPU pinning, but if you share your > > storage nodes with VMs you really want that. Pin the VMS to CPUs/cores > > that are not busy handling IRQs, leave in your case about 4 cores at > > least free for the system and OSD tasks. > > > > CPU usage is very much on the low end for the machines we run. Our > average CPU usage is at 2%, and peak over the last 30 days is at 15% > during a maintenance reboot. Our current nodes have dual quad L5520s, > and we'll be changing/migrating to dual Hex-core, so we'll have a 50% > increase in cores per node, and we'll be more than doubling our cores by > going to 3 nodes from 2. Our current RAM is 48GB and usage never > exceeds 8GB. These machines are very over provisioned. I could > probably go down to 72GB, since the boards have 18 slots. > If it is that light, fine. But keeping CPU cores free for the OS and friends is a good approach no matter how you look at it. > > > I saw the performance tests run, and the only thing that "scares" me > > > is the btfrs using a huge chunk of CPU. Will probably use ext4 > > > Wondering if this was usage on a single core, or per core per > > > instance of the test... > > > > > BTRFS has some great features. > > It also is very resource intensive and has abysmal performance in many > > use cases, often getting worse quickly, especially with Ceph. > > > > Ext4 or XFS will serve you better in your case, no doubt. > > I was not aware of performance degradation over time. I will need to > look into this a bit more closely. I give you: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-February/007824.html http://www.spinics.net/lists/ceph-devel/msg07249.html > From the test scenarios I saw from > ceph blog > (http://ceph.com/community/ceph-performance-part-1-disk-controller-write-throughput/) , > it shows that btfrs had the best performance across the board in JBOD > mode, and xfs/ext4 were pretty close. Our config is going to be a > SAS2008 JBOD setup for the controller. Looking a bit more closely, the > xfs does appear to be fairly close to ext4, and if the recommended file > system, I'm not seeing enough of a degradation in performance to warrant > using ext4 vs recoverability of xfs in case of failure. > I give you: https://www.mail-archive.com/ceph-users at lists.ceph.com/msg08619.html Note that nobody ever replied to that. Either way, XFS or ext4 will do better than BTRFS for now. Christian > > > > Regards, > > > > Christian > > > > > Thanks in advance > > > > > > Carlos M. Perez > > > CMP Consulting Services > > > 305-669-1515 > > > > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi at gol.com Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > > Thanks again, > > Carlos M. Perez > CMP Consulting Services > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/