Hi Greg, I'm only talking about journal disks not storage. :) Regards, Quenten -----Original Message----- From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Gregory Farnum Sent: Tuesday, 22 May 2012 10:30 AM To: Quenten Grasso Cc: ceph-devel@xxxxxxxxxxxxxxx Subject: Re: Designing a cluster guide On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso <QGrasso@xxxxxxxxxx> wrote: > Hi All, > > > I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks, > in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage. > > Can someone help clarify this one, > > Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client? > Or > Once the data is written to the (journal disk) is this considered successful by the client? This one — the write is considered "safe" once it is on-disk on all OSDs currently responsible for hosting the object. Every time anybody mentions RAID10 I have to remind them of the storage amplification that entails, though. Are you sure you want that on top of (well, underneath, really) Ceph's own replication? > Or > Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful) > > > Pros > Quite fast Write throughput to the journal disks, > No write wareout of SSD's > RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well) > > > Cons > Not as fast as SSD's > More rackspace required per server. > > > Regards, > Quenten > > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Slawomir Skowron > Sent: Tuesday, 22 May 2012 7:22 AM > To: ceph-devel@xxxxxxxxxxxxxxx > Cc: Tomasz Paszkowski > Subject: Re: Designing a cluster guide > > Maybe good for journal will be two cheap MLC Intel drives on Sandforce > (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for > separate journaling partitions with hardware RAID1. > > I like to test setup like this, but maybe someone have any real life info ?? > > On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski <ss7pro@xxxxxxxxx> wrote: >> Another great thing that should be mentioned is: >> https://github.com/facebook/flashcache/. It gives really huge >> performance improvements for reads/writes (especialy on FunsionIO >> drives) event without using librbd caching :-) >> >> >> >> On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote: >>> Hi, >>> >>> For your journal , if you have money, you can use >>> >>> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block). >>> I'm using them with zfs san, they rocks for journal. >>> http://www.stec-inc.com/product/zeusram.php >>> >>> another interessesting product is ddrdrive >>> http://www.ddrdrive.com/ >>> >>> ----- Mail original ----- >>> >>> De: "Stefan Priebe" <s.priebe@xxxxxxxxxxxx> >>> À: "Gregory Farnum" <greg@xxxxxxxxxxx> >>> Cc: ceph-devel@xxxxxxxxxxxxxxx >>> Envoyé: Samedi 19 Mai 2012 10:37:01 >>> Objet: Re: Designing a cluster guide >>> >>> Hi Greg, >>> >>> Am 17.05.2012 23:27, schrieb Gregory Farnum: >>>>> It mentions for example "Fast CPU" for the mds system. What does fast >>>>> mean? Just the speed of one core? Or is ceph designed to use multi core? >>>>> Is multi core or more speed important? >>>> Right now, it's primarily the speed of a single core. The MDS is >>>> highly threaded but doing most things requires grabbing a big lock. >>>> How fast is a qualitative rather than quantitative assessment at this >>>> point, though. >>> So would you recommand a fast (more ghz) Core i3 instead of a single >>> xeon for this system? (price per ghz is better). >>> >>>> It depends on what your nodes look like, and what sort of cluster >>>> you're running. The monitors are pretty lightweight, but they will add >>>> *some* load. More important is their disk access patterns — they have >>>> to do a lot of syncs. So if they're sharing a machine with some other >>>> daemon you want them to have an independent disk and to be running a >>>> new kernel&glibc so that they can use syncfs rather than sync. (The >>>> only distribution I know for sure does this is Ubuntu 12.04.) >>> Which kernel and which glibc version supports this? I have searched >>> google but haven't found an exact version. We're using debian lenny >>> squeeze with a custom kernel. >>> >>>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and >>>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd >>>>> and you should go for 22x SSD Disks in a Raid 6? >>>> You'll need to do your own failure calculations on this one, I'm >>>> afraid. Just take note that you'll presumably be limited to the speed >>>> of your journaling device here. >>> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or >>> is this still too slow? Another idea was to use only a ramdisk for the >>> journal and backup the files while shutting down to disk and restore >>> them after boot. >>> >>>> Given that Ceph is going to be doing its own replication, though, I >>>> wouldn't want to add in another whole layer of replication with raid10 >>>> — do you really want to multiply your storage requirements by another >>>> factor of two? >>> OK correct bad idea. >>> >>>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid? >>>> I would use the hardware controller over btrfs raid for now; it allows >>>> more flexibility in eg switching to xfs. :) >>> OK but overall you would recommand running one osd per disk right? So >>> instead of using a Raid 6 with for example 10 disks you would run 6 osds >>> on this machine? >>> >>>>> Use single socket Xeon for the OSDs or Dual Socket? >>>> Dual socket servers will be overkill given the setup you're >>>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD >>>> daemon. You might consider it if you decided you wanted to do an OSD >>>> per disk instead (that's a more common configuration, but it requires >>>> more CPU and RAM per disk and we don't know yet which is the better >>>> choice). >>> Is there also a rule of thumb for the memory? >>> >>> My biggest problem with ceph right now is the awful slow speed while >>> doing random reads and writes. >>> >>> Sequential read and writes are at 200Mb/s (that's pretty good for bonded >>> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s >>> which is def. too slow. >>> >>> Stefan >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> -- >>> >>> -- >>> >>> >>> >>> >>> Alexandre D erumier >>> Ingénieur Système >>> Fixe : 03 20 68 88 90 >>> Fax : 03 20 68 90 81 >>> 45 Bvd du Général Leclerc 59100 Roubaix - France >>> 12 rue Marivaux 75002 Paris - France >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> -- >> Tomasz Paszkowski >> SS7, Asterisk, SAN, Datacenter, Cloud Computing >> +48500166299 >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > ----- > Pozdrawiam > > Sławek "sZiBis" Skowron > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f