On Mon, 8 Jun 2015 10:12:02 +0200 Jan Schermer wrote: > > > On 08 Jun 2015, at 10:07, Christian Balzer <chibi@xxxxxxx> wrote: > > > > On Mon, 8 Jun 2015 09:44:54 +0200 Jan Schermer wrote: > > > >> I recently did some testing of a few SSDs and found some surprising, > >> and some not so surprising things: > >> > >> 1) performance varies wildly with firmware, especially with cheaper > >> drives > >> 2) performance varies with time - even with S3700 - slows down > >> after ~40-80GB and then creeps back up > >> 3) cheaper drives have almost no gain from bigger queues (at least > >> with fio iodepth=XX) > >> 4) all drives I tested reach higher IOPS for direct and > >> synchronous writes with iodepth=1 when write cache is *DISABLED* (even > >> S3700) > >> - I suspect compression and write coalescing are disabled > >> - Intel S3700 can reach almost the same IOPS with higher queue depths, > >> but that’s sadly not a real scenario > >> - in any case, disabling write cache doesn’t help real workloads on my > >> cluster > > > >> 5) write amplification is worst for synchronous direct writes, > >> so it’s much better to collocate journal and data on the same SSD if > >> you worry about DWPD endurance rating > >> > > Or in his case _not_ have them on the low endurance EVOs. ^_- > > Not necesarrily - drives have to write the data sooner or later, writing > the same block twice doesn’t always mean draining DWPD rating at twice > the rate, in fact if it writes with 32kb blocks it might almost the same > because SSDc have much larger blocks anyway… > > if there is an async IO queued (data) and another synchronous IO comes > for the journal, then it might be a single atomic operation on the SSD > because it has to flush everything it has. > While true, there are a number of might/maybes in there and with those SSDs in particular I'd probably opt for anything that can reduce writes. Never mind the less than stellar expected performance with a co-located journal. All that of course if he can afford DC S3700s for OS/journal. > > > > >> and bottom line > >> 6) CEPH doesn’t utilize the SSD capabilities at all on my Dumpling - I > >> hope it will be better on Giant :) 7) ext4 has much higher throughput > >> for my benchmarks when not using a raw device 8) I lose ~50% IOPS on > >> XFS compared to block device - ext4 loses ~10% > >> > > I can confirm that from my old (emperor days) comparison between EXT4 > > and XFS. > > Do you have the benchmark somewhere? This wasn’t really what I was > testing, some number would be very helpful… Did you test the various > ext4 options? > Unfortunately, no. At least not in any meaningful detail. But what I have left and remember matches that ratio you're seeing. I think the most significant change was to give EXT4 a maximum sized journal. Christian > Jan > > > > > Christian > > > >> btw the real average request size on my SSDs is only about 32KiB > >> (journal+data on the same device) > >> > >> Jan > >> > >>> On 08 Jun 2015, at 09:33, Christian Balzer <chibi@xxxxxxx> wrote: > >>> > >>> > >>> Hello, > >>> > >>> On Mon, 8 Jun 2015 18:01:28 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote: > >>> > >>>> Just used the method in the link you sent me to test one of the EVO > >>>> 850s, with one job it reached a speed of around 2.5MB/s but it > >>>> didn't max out until around 32 jobs at 24MB/s: > >>>> > >>> I'm not the author of that page, nor did I verify that they did use a > >>> uniform methodology/environment, nor do I think that this test is a > >>> particular close approach to what I see Ceph doing in reality (it > >>> seems to write much larger chunks that 4KB). > >>> I'd suggest keeping numjobs to 1 and ramping up the block size to > >>> 4MB, see where you max out with that. > >>> I can reach the theoretical max speed of my SSDs (350MB/s) at 4MB > >>> blocks, but it's already at 90% with 1MB. > >>> > >>> That test does however produce interesting numbers that seem to be > >>> consistent by themselves and match what people have reporting here. > >>> > >>> I can get 22MB/s with just one fio job with the above setting (alas > >>> on filesystem, no spare raw partition right now) on a DC S3700 200GB > >>> SSD, directly conntected to an onboard Intel SATA-3 port. > >>> > >>> Now think what that means, a fio numjob is equivalent to an OSD > >>> daemon, so in this worst case 4KB scenario my journal and thus OSD > >>> would be 10 times faster than yours. > >>> Food for thought. > >>> > >>>> sudo fio --filename=/dev/sdh --direct=1 --sync=1 --rw=write --bs=4k > >>>> --numjobs=32 --iodepth=1 --runtime=60 --time_based > >>>> --group_reporting --name=journal-test > >>>> write: io=1507.4MB, bw=25723KB/s, iops=6430, runt= 60007msec > >>>> > >>>> Also tested a Micron 550 we had sitting around and it maxed out at > >>>> 2.5mb/s, both results conflict with the chart > >>>> > >>> Note that they disabled on SSD and controller caches, the former is > >>> of course messing things up where this isn't needed. > >>> > >>> I'd suggest you go and do a test install of Ceph with your HW and > >>> test that. > >>> Paying close attention to your SSD utilization with atop or iostat, > >>> etc. > >>> > >>> Christian > >>> > >>>> Regards, > >>>> > >>>> Cameron Scrace > >>>> Infrastructure Engineer > >>>> > >>>> Mobile +64 22 610 4629 > >>>> Phone +64 4 462 5085 > >>>> Email cameron.scrace@xxxxxxxxxxxx > >>>> Solnet Solutions Limited > >>>> Level 12, Solnet House > >>>> 70 The Terrace, Wellington 6011 > >>>> PO Box 397, Wellington 6140 > >>>> > >>>> www.solnet.co.nz > >>>> > >>>> > >>>> > >>>> From: Christian Balzer <chibi@xxxxxxx> > >>>> To: "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx> > >>>> Cc: Cameron.Scrace@xxxxxxxxxxxx > >>>> Date: 08/06/2015 02:40 p.m. > >>>> Subject: Re: Multiple journals and an OSD on one > >>>> SSD doable? > >>>> > >>>> > >>>> > >>>> On Mon, 8 Jun 2015 14:30:17 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote: > >>>> > >>>>> Thanks for all the feedback. > >>>>> > >>>>> What makes the EVOs unusable? They should have plenty of speed but > >>>>> your link has them at 1.9MB/s, is it just the way they handle > >>>>> O_DIRECT and D_SYNC? > >>>>> > >>>> Precisely. > >>>> Read that ML thread for details. > >>>> > >>>> And once more, they also are not very endurable. > >>>> So depending on your usage pattern and Ceph (Ceph itself and the > >>>> underlying FS) write amplification their TBW/$ will be horrible, > >>>> costing you more in the end than more expensive, but an order of > >>>> magnitude more endurable DC SSDs. > >>>> > >>>>> Not sure if we will be able to spend anymore, we may just have to > >>>>> take the performance hit until we can get more money for the > >>>>> project. > >>>>> > >>>> You could cheap out with 200GB DC S3700s (half the price), but they > >>>> will definitely become the bottleneck at a combined max speed of > >>>> about 700MB/s, as opposed to the 400GB ones at 900MB/s combined. > >>>> > >>>> Christian > >>>> > >>>>> Thanks, > >>>>> > >>>>> Cameron Scrace > >>>>> Infrastructure Engineer > >>>>> > >>>>> Mobile +64 22 610 4629 > >>>>> Phone +64 4 462 5085 > >>>>> Email cameron.scrace@xxxxxxxxxxxx > >>>>> Solnet Solutions Limited > >>>>> Level 12, Solnet House > >>>>> 70 The Terrace, Wellington 6011 > >>>>> PO Box 397, Wellington 6140 > >>>>> > >>>>> www.solnet.co.nz > >>>>> > >>>>> > >>>>> > >>>>> From: Christian Balzer <chibi@xxxxxxx> > >>>>> To: "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx> > >>>>> Cc: Cameron.Scrace@xxxxxxxxxxxx > >>>>> Date: 08/06/2015 02:00 p.m. > >>>>> Subject: Re: Multiple journals and an OSD on > >>>>> one SSD > >>>> > >>>>> doable? > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> Cameron, > >>>>> > >>>>> To offer at least some constructive advice here instead of just all > >>>>> doom and gloom, here's what I'd do: > >>>>> > >>>>> Replace the OS SSDs with 2 400GB Intel DC S3700s (or S3710s). > >>>>> They have enough BW to nearly saturate your network. > >>>>> > >>>>> Put all your journals on them (3 SSD OSD and 3 HDD OSD per). > >>>>> While that's a bad move from a failure domain perspective, your > >>>>> budget probably won't allow for anything better and those are VERY > >>>>> reliable and just as important durable SSDs. > >>>>> > >>>>> This will give you the speed your current setup is capable of, > >>>>> probably limited by the CPU when it comes to SSD pool operations. > >>>>> > >>>>> Christian > >>>>> > >>>>> On Mon, 8 Jun 2015 10:44:06 +0900 Christian Balzer wrote: > >>>>> > >>>>>> > >>>>>> Hello Cameron, > >>>>>> > >>>>>> On Mon, 8 Jun 2015 13:13:33 +1200 Cameron.Scrace@xxxxxxxxxxxx > >>>>>> wrote: > >>>>>> > >>>>>>> Hi Christian, > >>>>>>> > >>>>>>> Yes we have purchased all our hardware, was very hard to > >>>>>>> convince management/finance to approve it, so some of the stuff > >>>>>>> we have is a bit cheap. > >>>>>>> > >>>>>> Unfortunate. Both the done deal and the cheapness. > >>>>>> > >>>>>>> We have four storage nodes each with 6 x 6TB Western Digital Red > >>>>>>> SATA Drives (WD60EFRX-68M) and 6 x 1TB Samsung EVO 850s SSDs and > >>>>>>> 2x250GB Samsung EVO 850s (for OS raid). > >>>>>>> CPUs are Intel Atom C2750 @ 2.40GHz (8 Cores) with 32 GB of > >>>>>>> RAM. We have a 10Gig Network. > >>>>>>> > >>>>>> I wish there was a nice way to say this, but it unfortunately > >>>>>> boils down to a "You're fooked". > >>>>>> > >>>>>> There have been many discussions about which SSDs are usable with > >>>> Ceph, > >>>>>> very recently as well. > >>>>>> Samsung EVOs (the non DC type for sure) are basically unusable for > >>>>>> journals. See the recent thread: > >>>>>> Possible improvements for a slow write speed (excluding > >>>>>> independent SSD journals) and: > >>>>>> > >>>>> > >>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > >>>> > >>>>> > >>>>>> for reference. > >>>>>> > >>>>>> I presume your intention for the 1TB SSDs is for a SSD backed > >>>>>> pool? Note that the EVOs have a pretty low (guaranteed) > >>>>>> endurance, so aside from needing journal SSDs that actually can > >>>>>> do the job, you're looking > >>>> > >>>>> at > >>>>>> wearing them out rather quickly (depending on your use case of > >>>> course). > >>>>>> > >>>>>> Now with SSD based OSDs or even HDD based OSDs with SSD journals > >>>>>> that > >>>>> CPU > >>>>>> looks a bit anemic. > >>>>>> > >>>>>> More below: > >>>>>>> The two options we are considering are: > >>>>>>> > >>>>>>> 1) Use two of the 1TB SSDs for the spinning disk journals (3 > >>>>>>> each) and > >>>>> > >>>>>>> then use the remaining 900+GB of each drive as an OSD to be part > >>>>>>> of the cache pool. > >>>>>>> > >>>>>>> 2) Put the spinning disk journals on the OS SSDs and use the 2 > >>>>>>> 1TB SSDs for the cache pool. > >>>>>>> > >>>>>> Cache pools aren't all that speedy currently (research the ML > >>>>>> archives), even less so with the SSDs you have. > >>>>>> > >>>>>> Christian > >>>>>> > >>>>>>> In both cases the other 4 1TB SSDs will be part of their own > >>>>>>> tier. > >>>>>>> > >>>>>>> Thanks a lot! > >>>>>>> > >>>>>>> Cameron Scrace > >>>>>>> Infrastructure Engineer > >>>>>>> > >>>>>>> Mobile +64 22 610 4629 > >>>>>>> Phone +64 4 462 5085 > >>>>>>> Email cameron.scrace@xxxxxxxxxxxx > >>>>>>> Solnet Solutions Limited > >>>>>>> Level 12, Solnet House > >>>>>>> 70 The Terrace, Wellington 6011 > >>>>>>> PO Box 397, Wellington 6140 > >>>>>>> > >>>>>>> www.solnet.co.nz > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> From: Christian Balzer <chibi@xxxxxxx> > >>>>>>> To: "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx> > >>>>>>> Cc: Cameron.Scrace@xxxxxxxxxxxx > >>>>>>> Date: 08/06/2015 12:18 p.m. > >>>>>>> Subject: Re: Multiple journals and an OSD on > >>>>>>> one SSD doable? > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>> > >>>>>>> On Mon, 8 Jun 2015 09:55:56 +1200 Cameron.Scrace@xxxxxxxxxxxx > >>>>>>> wrote: > >>>>>>> > >>>>>>>> The other option we were considering was putting the journals on > >>>>>>>> the OS SSDs, they are only 250GB and the rest would be for the > >>>>>>>> OS. Is that a decent option? > >>>>>>>> > >>>>>>> You'll be getting a LOT better advice if you're telling us more > >>>>>>> details. > >>>>>>> > >>>>>>> For starters, have you bought the hardware yet? > >>>>>>> Tell us about your design, how many initial storage nodes, how > >>>>>>> many HDDs/SSDs per node, what CPUs/RAM/network? > >>>>>>> > >>>>>>> What SSDs are we talking about, exact models please. > >>>>>>> (Both the sizes you mentioned do not ring a bell for DC level > >>>>>>> SSDs I'm aware of) > >>>>>>> > >>>>>>> That said, I'm using Intel DC S3700s for mixed OS and journal use > >>>>>>> with > >>>>> > >>>>>>> good > >>>>>>> results. > >>>>>>> In your average Ceph storage node, normal OS (logging mostly) > >>>>>>> activity is a > >>>>>>> minute drop in the bucket for any decent SSD, so nearly all of > >>>>>>> it's resources are available to journals. > >>>>>>> > >>>>>>> You want to match the number of journals per SSD according to the > >>>>>>> capabilities of your SSD, HDDs and network. > >>>>>>> > >>>>>>> For example 8 HDD OSDs with 2 200GB DC S3700 and a 10Gb/s network > >>>>>>> is a decent match. > >>>>>>> The two SSDs at 900MB/s would appear to be the bottleneck, but in > >>>>>>> reality I'd expect the HDDs to be it. > >>>>>>> Never mind that you'd be more likely to be IOPS than bandwidth > >>>> bound. > >>>>>>> > >>>>>>> Regards, > >>>>>>> > >>>>>>> Christian > >>>>>>> > >>>>>>>> Thanks! > >>>>>>>> > >>>>>>>> Cameron Scrace > >>>>>>>> Infrastructure Engineer > >>>>>>>> > >>>>>>>> Mobile +64 22 610 4629 > >>>>>>>> Phone +64 4 462 5085 > >>>>>>>> Email cameron.scrace@xxxxxxxxxxxx > >>>>>>>> Solnet Solutions Limited > >>>>>>>> Level 12, Solnet House > >>>>>>>> 70 The Terrace, Wellington 6011 > >>>>>>>> PO Box 397, Wellington 6140 > >>>>>>>> > >>>>>>>> www.solnet.co.nz > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> From: Somnath Roy <Somnath.Roy@xxxxxxxxxxx> > >>>>>>>> To: "Cameron.Scrace@xxxxxxxxxxxx" > >>>>>>>> <Cameron.Scrace@xxxxxxxxxxxx>, > >>>>> > >>>>>>>> "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx> > >>>>>>>> Date: 08/06/2015 09:34 a.m. > >>>>>>>> Subject: RE: Multiple journals and an OSD on > >>>>>>>> one SSD > >>>>>>> > >>>>>>>> doable? > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Cameron, > >>>>>>>> Generally, it’s not a good idea. > >>>>>>>> You want to protect your SSDs used as journal.If any problem on > >>>>>>>> that disk, you will be losing all of your dependent OSDs. > >>>>>>>> I don’t think a bigger journal will gain you much performance , > >>>>>>>> so, default 5 GB journal size should be good enough. If you > >>>>>>>> want > >>>> to > >>>>>>>> reduce the fault domain and want to put 3 journals on a SSD , go > >>>>>>>> for minimum size and high endurance SSDs for that. > >>>>>>>> Now, if you want to use your rest of space of 1 TB ssd, > >>>>>>>> creating > >>>>> just > >>>>>>>> OSDs will not gain you much (rather may get some burst > >>>>>>>> performance). You may want to consider the following. > >>>>>>>> > >>>>>>>> 1. If your spindle OSD size is much bigger than 900 GB , you > >>>>>>>> don’t want to make all OSDs of similar sizes, cache pool could > >>>>>>>> be one of your option. But, remember, cache pool can wear out > >>>>>>>> your SSDs faster as presently I guess it is not optimizing the > >>>>>>>> extra writes. Sorry, I don’t have exact data as I am yet to test > >>>>>>>> that out. > >>>>>>>> > >>>>>>>> 2. If you want to make all the OSDs of similar sizes and you > >>>>>>>> will be able to create a substantial number of OSDs with your > >>>>>>>> unused SSDs (depends on how big the cluster is), you may want > >>>>>>>> to put all of your primary OSDs to SSD and gain significant > >>>>>>>> performance boost for read. Also, in this case, I don’t think > >>>>>>>> you will be getting any burst performance. > >>>>>>>> Thanks & Regards > >>>>>>>> Somnath > >>>>>>>> > >>>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On > >>>>> Behalf > >>>>>>>> Of > >>>>>>> > >>>>>>>> Cameron.Scrace@xxxxxxxxxxxx > >>>>>>>> Sent: Sunday, June 07, 2015 1:49 PM > >>>>>>>> To: ceph-users@xxxxxxxx > >>>>>>>> Subject: Multiple journals and an OSD on one SSD > >>>>> doable? > >>>>>>>> > >>>>>>>> Setting up a Ceph cluster and we want the journals for our > >>>> spinning > >>>>>>>> disks to be on SSDs but all of our SSDs are 1TB. We were > >>>>>>>> planning on putting 3 journals on each SSD, but that leaves > >>>>>>>> 900+GB unused on the drive, is it possible to use the leftover > >>>>>>>> space as another OSD or will it affect performance too much? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> Cameron Scrace > >>>>>>>> Infrastructure Engineer > >>>>>>>> > >>>>>>>> Mobile +64 22 610 4629 > >>>>>>>> Phone +64 4 462 5085 > >>>>>>>> Email cameron.scrace@xxxxxxxxxxxx > >>>>>>>> Solnet Solutions Limited > >>>>>>>> Level 12, Solnet House > >>>>>>>> 70 The Terrace, Wellington 6011 > >>>>>>>> PO Box 397, Wellington 6140 > >>>>>>>> > >>>>>>>> www.solnet.co.nzAttention: This email may contain information > >>>>>>>> intended for the sole use of the original recipient. Please > >>>> respect > >>>>>>>> this when sharing or disclosing this email's contents with any > >>>>>>>> third party. If you believe you have received this email in > >>>>>>>> error, please delete it and notify the sender or > >>>>>>>> postmaster@xxxxxxxxxxxxxxxxxxxxx as soon as possible. The > >>>>>>>> content of this email does not necessarily reflect the views of > >>>>>>>> Solnet Solutions Ltd. > >>>>>>>> > >>>>>>>> > >>>>>>>> PLEASE NOTE: The information contained in this electronic mail > >>>>>>>> message is intended only for the use of the designated > >>>> recipient(s) > >>>>>>>> named above. If the reader of this message is not the intended > >>>>>>>> recipient, you are hereby notified that you have received this > >>>>>>>> message in error and that any review, dissemination, > >>>>>>>> distribution, or copying of this message is strictly prohibited. > >>>>>>>> If you have received this communication in error, please notify > >>>>>>>> the sender by telephone or e-mail (as shown above) immediately > >>>>>>>> and destroy any and all copies of this message in your > >>>>>>>> possession (whether hard copies or electronically stored > >>>>>>>> copies). > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Attention: > >>>>>>>> This email may contain information intended for the sole use of > >>>>>>>> the original recipient. Please respect this when sharing or > >>>>>>>> disclosing this email's contents with any third party. If you > >>>>>>>> believe you have received this email in error, please delete it > >>>>>>>> and notify the sender or postmaster@xxxxxxxxxxxxxxxxxxxxx as > >>>>>>>> soon as possible. The content of this email does not necessarily > >>>>>>>> reflect the views of Solnet Solutions Ltd. > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >>> -- > >>> Christian Balzer Network/Systems Engineer > >>> chibi@xxxxxxx Global OnLine Japan/Fusion Communications > >>> http://www.gol.com/ > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> > > > > > > -- > > Christian Balzer Network/Systems Engineer > > chibi@xxxxxxx Global OnLine Japan/Fusion Communications > > http://www.gol.com/ > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com