On Mon, 8 Jun 2015 09:44:54 +0200 Jan Schermer wrote: > I recently did some testing of a few SSDs and found some surprising, and > some not so surprising things: > > 1) performance varies wildly with firmware, especially with cheaper > drives > 2) performance varies with time - even with S3700 - slows down > after ~40-80GB and then creeps back up > 3) cheaper drives have almost no gain from bigger queues (at least with > fio iodepth=XX) > 4) all drives I tested reach higher IOPS for direct and > synchronous writes with iodepth=1 when write cache is *DISABLED* (even > S3700) > - I suspect compression and write coalescing are disabled > - Intel S3700 can reach almost the same IOPS with higher queue depths, > but that’s sadly not a real scenario > - in any case, disabling write cache doesn’t help real workloads on my > cluster > 5) write amplification is worst for synchronous direct writes, > so it’s much better to collocate journal and data on the same SSD if you > worry about DWPD endurance rating > Or in his case _not_ have them on the low endurance EVOs. ^_- > and bottom line > 6) CEPH doesn’t utilize the SSD capabilities at all on my Dumpling - I > hope it will be better on Giant :) 7) ext4 has much higher throughput > for my benchmarks when not using a raw device 8) I lose ~50% IOPS on XFS > compared to block device - ext4 loses ~10% > I can confirm that from my old (emperor days) comparison between EXT4 and XFS. Christian > btw the real average request size on my SSDs is only about 32KiB > (journal+data on the same device) > > Jan > > > On 08 Jun 2015, at 09:33, Christian Balzer <chibi@xxxxxxx> wrote: > > > > > > Hello, > > > > On Mon, 8 Jun 2015 18:01:28 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote: > > > >> Just used the method in the link you sent me to test one of the EVO > >> 850s, with one job it reached a speed of around 2.5MB/s but it didn't > >> max out until around 32 jobs at 24MB/s: > >> > > I'm not the author of that page, nor did I verify that they did use a > > uniform methodology/environment, nor do I think that this test is a > > particular close approach to what I see Ceph doing in reality (it seems > > to write much larger chunks that 4KB). > > I'd suggest keeping numjobs to 1 and ramping up the block size to 4MB, > > see where you max out with that. > > I can reach the theoretical max speed of my SSDs (350MB/s) at 4MB > > blocks, but it's already at 90% with 1MB. > > > > That test does however produce interesting numbers that seem to be > > consistent by themselves and match what people have reporting here. > > > > I can get 22MB/s with just one fio job with the above setting (alas on > > filesystem, no spare raw partition right now) on a DC S3700 200GB SSD, > > directly conntected to an onboard Intel SATA-3 port. > > > > Now think what that means, a fio numjob is equivalent to an OSD > > daemon, so in this worst case 4KB scenario my journal and thus OSD > > would be 10 times faster than yours. > > Food for thought. > > > >> sudo fio --filename=/dev/sdh --direct=1 --sync=1 --rw=write --bs=4k > >> --numjobs=32 --iodepth=1 --runtime=60 --time_based --group_reporting > >> --name=journal-test > >> write: io=1507.4MB, bw=25723KB/s, iops=6430, runt= 60007msec > >> > >> Also tested a Micron 550 we had sitting around and it maxed out at > >> 2.5mb/s, both results conflict with the chart > >> > > Note that they disabled on SSD and controller caches, the former is of > > course messing things up where this isn't needed. > > > > I'd suggest you go and do a test install of Ceph with your HW and test > > that. > > Paying close attention to your SSD utilization with atop or iostat, > > etc. > > > > Christian > > > >> Regards, > >> > >> Cameron Scrace > >> Infrastructure Engineer > >> > >> Mobile +64 22 610 4629 > >> Phone +64 4 462 5085 > >> Email cameron.scrace@xxxxxxxxxxxx > >> Solnet Solutions Limited > >> Level 12, Solnet House > >> 70 The Terrace, Wellington 6011 > >> PO Box 397, Wellington 6140 > >> > >> www.solnet.co.nz > >> > >> > >> > >> From: Christian Balzer <chibi@xxxxxxx> > >> To: "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx> > >> Cc: Cameron.Scrace@xxxxxxxxxxxx > >> Date: 08/06/2015 02:40 p.m. > >> Subject: Re: Multiple journals and an OSD on one > >> SSD doable? > >> > >> > >> > >> On Mon, 8 Jun 2015 14:30:17 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote: > >> > >>> Thanks for all the feedback. > >>> > >>> What makes the EVOs unusable? They should have plenty of speed but > >>> your link has them at 1.9MB/s, is it just the way they handle > >>> O_DIRECT and D_SYNC? > >>> > >> Precisely. > >> Read that ML thread for details. > >> > >> And once more, they also are not very endurable. > >> So depending on your usage pattern and Ceph (Ceph itself and the > >> underlying FS) write amplification their TBW/$ will be horrible, > >> costing you more in the end than more expensive, but an order of > >> magnitude more endurable DC SSDs. > >> > >>> Not sure if we will be able to spend anymore, we may just have to > >>> take the performance hit until we can get more money for the project. > >>> > >> You could cheap out with 200GB DC S3700s (half the price), but they > >> will definitely become the bottleneck at a combined max speed of about > >> 700MB/s, as opposed to the 400GB ones at 900MB/s combined. > >> > >> Christian > >> > >>> Thanks, > >>> > >>> Cameron Scrace > >>> Infrastructure Engineer > >>> > >>> Mobile +64 22 610 4629 > >>> Phone +64 4 462 5085 > >>> Email cameron.scrace@xxxxxxxxxxxx > >>> Solnet Solutions Limited > >>> Level 12, Solnet House > >>> 70 The Terrace, Wellington 6011 > >>> PO Box 397, Wellington 6140 > >>> > >>> www.solnet.co.nz > >>> > >>> > >>> > >>> From: Christian Balzer <chibi@xxxxxxx> > >>> To: "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx> > >>> Cc: Cameron.Scrace@xxxxxxxxxxxx > >>> Date: 08/06/2015 02:00 p.m. > >>> Subject: Re: Multiple journals and an OSD on one > >>> SSD > >> > >>> doable? > >>> > >>> > >>> > >>> > >>> Cameron, > >>> > >>> To offer at least some constructive advice here instead of just all > >>> doom and gloom, here's what I'd do: > >>> > >>> Replace the OS SSDs with 2 400GB Intel DC S3700s (or S3710s). > >>> They have enough BW to nearly saturate your network. > >>> > >>> Put all your journals on them (3 SSD OSD and 3 HDD OSD per). > >>> While that's a bad move from a failure domain perspective, your > >>> budget probably won't allow for anything better and those are VERY > >>> reliable and just as important durable SSDs. > >>> > >>> This will give you the speed your current setup is capable of, > >>> probably limited by the CPU when it comes to SSD pool operations. > >>> > >>> Christian > >>> > >>> On Mon, 8 Jun 2015 10:44:06 +0900 Christian Balzer wrote: > >>> > >>>> > >>>> Hello Cameron, > >>>> > >>>> On Mon, 8 Jun 2015 13:13:33 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote: > >>>> > >>>>> Hi Christian, > >>>>> > >>>>> Yes we have purchased all our hardware, was very hard to convince > >>>>> management/finance to approve it, so some of the stuff we have is a > >>>>> bit cheap. > >>>>> > >>>> Unfortunate. Both the done deal and the cheapness. > >>>> > >>>>> We have four storage nodes each with 6 x 6TB Western Digital Red > >>>>> SATA Drives (WD60EFRX-68M) and 6 x 1TB Samsung EVO 850s SSDs and > >>>>> 2x250GB Samsung EVO 850s (for OS raid). > >>>>> CPUs are Intel Atom C2750 @ 2.40GHz (8 Cores) with 32 GB of RAM. > >>>>> We have a 10Gig Network. > >>>>> > >>>> I wish there was a nice way to say this, but it unfortunately boils > >>>> down to a "You're fooked". > >>>> > >>>> There have been many discussions about which SSDs are usable with > >> Ceph, > >>>> very recently as well. > >>>> Samsung EVOs (the non DC type for sure) are basically unusable for > >>>> journals. See the recent thread: > >>>> Possible improvements for a slow write speed (excluding independent > >>>> SSD journals) and: > >>>> > >>> > >> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > >> > >>> > >>>> for reference. > >>>> > >>>> I presume your intention for the 1TB SSDs is for a SSD backed pool? > >>>> Note that the EVOs have a pretty low (guaranteed) endurance, so > >>>> aside from needing journal SSDs that actually can do the job, you're > >>>> looking > >> > >>> at > >>>> wearing them out rather quickly (depending on your use case of > >> course). > >>>> > >>>> Now with SSD based OSDs or even HDD based OSDs with SSD journals > >>>> that > >>> CPU > >>>> looks a bit anemic. > >>>> > >>>> More below: > >>>>> The two options we are considering are: > >>>>> > >>>>> 1) Use two of the 1TB SSDs for the spinning disk journals (3 each) > >>>>> and > >>> > >>>>> then use the remaining 900+GB of each drive as an OSD to be part of > >>>>> the cache pool. > >>>>> > >>>>> 2) Put the spinning disk journals on the OS SSDs and use the 2 1TB > >>>>> SSDs for the cache pool. > >>>>> > >>>> Cache pools aren't all that speedy currently (research the ML > >>>> archives), even less so with the SSDs you have. > >>>> > >>>> Christian > >>>> > >>>>> In both cases the other 4 1TB SSDs will be part of their own tier. > >>>>> > >>>>> Thanks a lot! > >>>>> > >>>>> Cameron Scrace > >>>>> Infrastructure Engineer > >>>>> > >>>>> Mobile +64 22 610 4629 > >>>>> Phone +64 4 462 5085 > >>>>> Email cameron.scrace@xxxxxxxxxxxx > >>>>> Solnet Solutions Limited > >>>>> Level 12, Solnet House > >>>>> 70 The Terrace, Wellington 6011 > >>>>> PO Box 397, Wellington 6140 > >>>>> > >>>>> www.solnet.co.nz > >>>>> > >>>>> > >>>>> > >>>>> From: Christian Balzer <chibi@xxxxxxx> > >>>>> To: "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx> > >>>>> Cc: Cameron.Scrace@xxxxxxxxxxxx > >>>>> Date: 08/06/2015 12:18 p.m. > >>>>> Subject: Re: Multiple journals and an OSD on > >>>>> one SSD doable? > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> Hello, > >>>>> > >>>>> > >>>>> On Mon, 8 Jun 2015 09:55:56 +1200 Cameron.Scrace@xxxxxxxxxxxx > >>>>> wrote: > >>>>> > >>>>>> The other option we were considering was putting the journals on > >>>>>> the OS SSDs, they are only 250GB and the rest would be for the > >>>>>> OS. Is that a decent option? > >>>>>> > >>>>> You'll be getting a LOT better advice if you're telling us more > >>>>> details. > >>>>> > >>>>> For starters, have you bought the hardware yet? > >>>>> Tell us about your design, how many initial storage nodes, how many > >>>>> HDDs/SSDs per node, what CPUs/RAM/network? > >>>>> > >>>>> What SSDs are we talking about, exact models please. > >>>>> (Both the sizes you mentioned do not ring a bell for DC level SSDs > >>>>> I'm aware of) > >>>>> > >>>>> That said, I'm using Intel DC S3700s for mixed OS and journal use > >>>>> with > >>> > >>>>> good > >>>>> results. > >>>>> In your average Ceph storage node, normal OS (logging mostly) > >>>>> activity is a > >>>>> minute drop in the bucket for any decent SSD, so nearly all of it's > >>>>> resources are available to journals. > >>>>> > >>>>> You want to match the number of journals per SSD according to the > >>>>> capabilities of your SSD, HDDs and network. > >>>>> > >>>>> For example 8 HDD OSDs with 2 200GB DC S3700 and a 10Gb/s network > >>>>> is a decent match. > >>>>> The two SSDs at 900MB/s would appear to be the bottleneck, but in > >>>>> reality I'd expect the HDDs to be it. > >>>>> Never mind that you'd be more likely to be IOPS than bandwidth > >> bound. > >>>>> > >>>>> Regards, > >>>>> > >>>>> Christian > >>>>> > >>>>>> Thanks! > >>>>>> > >>>>>> Cameron Scrace > >>>>>> Infrastructure Engineer > >>>>>> > >>>>>> Mobile +64 22 610 4629 > >>>>>> Phone +64 4 462 5085 > >>>>>> Email cameron.scrace@xxxxxxxxxxxx > >>>>>> Solnet Solutions Limited > >>>>>> Level 12, Solnet House > >>>>>> 70 The Terrace, Wellington 6011 > >>>>>> PO Box 397, Wellington 6140 > >>>>>> > >>>>>> www.solnet.co.nz > >>>>>> > >>>>>> > >>>>>> > >>>>>> From: Somnath Roy <Somnath.Roy@xxxxxxxxxxx> > >>>>>> To: "Cameron.Scrace@xxxxxxxxxxxx" > >>>>>> <Cameron.Scrace@xxxxxxxxxxxx>, > >>> > >>>>>> "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx> > >>>>>> Date: 08/06/2015 09:34 a.m. > >>>>>> Subject: RE: Multiple journals and an OSD on > >>>>>> one SSD > >>>>> > >>>>>> doable? > >>>>>> > >>>>>> > >>>>>> > >>>>>> Cameron, > >>>>>> Generally, it’s not a good idea. > >>>>>> You want to protect your SSDs used as journal.If any problem on > >>>>>> that disk, you will be losing all of your dependent OSDs. > >>>>>> I don’t think a bigger journal will gain you much performance , > >>>>>> so, default 5 GB journal size should be good enough. If you want > >> to > >>>>>> reduce the fault domain and want to put 3 journals on a SSD , go > >>>>>> for minimum size and high endurance SSDs for that. > >>>>>> Now, if you want to use your rest of space of 1 TB ssd, creating > >>> just > >>>>>> OSDs will not gain you much (rather may get some burst > >>>>>> performance). You may want to consider the following. > >>>>>> > >>>>>> 1. If your spindle OSD size is much bigger than 900 GB , you > >>>>>> don’t want to make all OSDs of similar sizes, cache pool could > >>>>>> be one of your option. But, remember, cache pool can wear out > >>>>>> your SSDs faster as presently I guess it is not optimizing the > >>>>>> extra writes. Sorry, I don’t have exact data as I am yet to test > >>>>>> that out. > >>>>>> > >>>>>> 2. If you want to make all the OSDs of similar sizes and you will > >>>>>> be able to create a substantial number of OSDs with your unused > >>>>>> SSDs (depends on how big the cluster is), you may want to put all > >>>>>> of your primary OSDs to SSD and gain significant performance > >>>>>> boost for read. Also, in this case, I don’t think you will be > >>>>>> getting any burst performance. > >>>>>> Thanks & Regards > >>>>>> Somnath > >>>>>> > >>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > >>> Behalf > >>>>>> Of > >>>>> > >>>>>> Cameron.Scrace@xxxxxxxxxxxx > >>>>>> Sent: Sunday, June 07, 2015 1:49 PM > >>>>>> To: ceph-users@xxxxxxxx > >>>>>> Subject: Multiple journals and an OSD on one SSD > >>> doable? > >>>>>> > >>>>>> Setting up a Ceph cluster and we want the journals for our > >> spinning > >>>>>> disks to be on SSDs but all of our SSDs are 1TB. We were planning > >>>>>> on putting 3 journals on each SSD, but that leaves 900+GB unused > >>>>>> on the drive, is it possible to use the leftover space as another > >>>>>> OSD or will it affect performance too much? > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> Cameron Scrace > >>>>>> Infrastructure Engineer > >>>>>> > >>>>>> Mobile +64 22 610 4629 > >>>>>> Phone +64 4 462 5085 > >>>>>> Email cameron.scrace@xxxxxxxxxxxx > >>>>>> Solnet Solutions Limited > >>>>>> Level 12, Solnet House > >>>>>> 70 The Terrace, Wellington 6011 > >>>>>> PO Box 397, Wellington 6140 > >>>>>> > >>>>>> www.solnet.co.nzAttention: This email may contain information > >>>>>> intended for the sole use of the original recipient. Please > >> respect > >>>>>> this when sharing or disclosing this email's contents with any > >>>>>> third party. If you believe you have received this email in > >>>>>> error, please delete it and notify the sender or > >>>>>> postmaster@xxxxxxxxxxxxxxxxxxxxx as soon as possible. The content > >>>>>> of this email does not necessarily reflect the views of Solnet > >>>>>> Solutions Ltd. > >>>>>> > >>>>>> > >>>>>> PLEASE NOTE: The information contained in this electronic mail > >>>>>> message is intended only for the use of the designated > >> recipient(s) > >>>>>> named above. If the reader of this message is not the intended > >>>>>> recipient, you are hereby notified that you have received this > >>>>>> message in error and that any review, dissemination, > >>>>>> distribution, or copying of this message is strictly prohibited. > >>>>>> If you have received this communication in error, please notify > >>>>>> the sender by telephone or e-mail (as shown above) immediately > >>>>>> and destroy any and all copies of this message in your > >>>>>> possession (whether hard copies or electronically stored copies). > >>>>>> > >>>>>> > >>>>>> > >>>>>> Attention: > >>>>>> This email may contain information intended for the sole use of > >>>>>> the original recipient. Please respect this when sharing or > >>>>>> disclosing this email's contents with any third party. If you > >>>>>> believe you have received this email in error, please delete it > >>>>>> and notify the sender or postmaster@xxxxxxxxxxxxxxxxxxxxx as > >>>>>> soon as possible. The content of this email does not necessarily > >>>>>> reflect the views of Solnet Solutions Ltd. > >>>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >> > >> > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com