Re: Multiple journals and an OSD on one SSD doable?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, 8 Jun 2015 09:44:54 +0200 Jan Schermer wrote:

> I recently did some testing of a few SSDs and found some surprising, and
> some not so surprising things:
> 
> 1) performance varies wildly with firmware, especially with cheaper
> drives 
> 2) performance varies with time - even with S3700 - slows down
> after ~40-80GB and then creeps back up 
> 3) cheaper drives have almost no gain from bigger queues (at least with
> fio iodepth=XX) 
> 4) all drives I tested reach higher IOPS for direct and
> synchronous writes with iodepth=1 when write cache is *DISABLED* (even
> S3700)
>  - I suspect compression and write coalescing are disabled
>  - Intel S3700 can reach almost the same IOPS with higher queue depths,
> but that’s sadly not a real scenario
>  - in any case, disabling write cache doesn’t help real workloads on my
> cluster 

> 5) write amplification is worst for synchronous direct writes,
> so it’s much better to collocate journal and data on the same SSD if you
> worry about DWPD endurance rating
> 
Or in his case _not_ have them on the low endurance EVOs. ^_-

> and bottom line
> 6) CEPH doesn’t utilize the SSD capabilities at all on my Dumpling - I
> hope it will be better on Giant :) 7) ext4 has much higher throughput
> for my benchmarks when not using a raw device 8) I lose ~50% IOPS on XFS
> compared to block device - ext4 loses ~10%
> 
I can confirm that from my old (emperor days) comparison between EXT4 and
XFS.

Christian

> btw the real average request size on my SSDs is only about 32KiB
> (journal+data on the same device)
> 
> Jan
> 
> > On 08 Jun 2015, at 09:33, Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > 
> > Hello,
> > 
> > On Mon, 8 Jun 2015 18:01:28 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote:
> > 
> >> Just used the method in the link you sent me to test one of the EVO
> >> 850s, with one job it reached a speed of around 2.5MB/s but it didn't
> >> max out until around 32 jobs at 24MB/s: 
> >> 
> > I'm not the author of that page, nor did I verify that they did use a
> > uniform methodology/environment, nor do I think that this test is a
> > particular close approach to what I see Ceph doing in reality (it seems
> > to write much larger chunks that 4KB).
> > I'd suggest keeping numjobs to 1 and ramping up the block size to 4MB,
> > see where you max out with that. 
> > I can reach the theoretical max speed of my SSDs (350MB/s) at 4MB
> > blocks, but it's already at 90% with 1MB.
> > 
> > That test does however produce interesting numbers that seem to be
> > consistent by themselves and match what people have reporting here.
> > 
> > I can get 22MB/s with just one fio job with the above setting (alas on
> > filesystem, no spare raw partition right now) on a DC S3700 200GB SSD,
> > directly conntected to an onboard Intel SATA-3 port.
> > 
> > Now think what that means, a fio numjob is equivalent to an OSD
> > daemon, so in this worst case 4KB scenario my journal and thus OSD
> > would be 10 times faster than yours.
> > Food for thought.
> > 
> >> sudo fio --filename=/dev/sdh --direct=1 --sync=1 --rw=write --bs=4k 
> >> --numjobs=32 --iodepth=1 --runtime=60 --time_based --group_reporting 
> >> --name=journal-test
> >> write: io=1507.4MB, bw=25723KB/s, iops=6430, runt= 60007msec
> >> 
> >> Also tested a Micron 550 we had sitting around and it maxed out at 
> >> 2.5mb/s, both results conflict with the chart
> >> 
> > Note that they disabled on SSD and controller caches, the former is of
> > course messing things up where this isn't needed.
> > 
> > I'd suggest you go and do a test install of Ceph with your HW and test
> > that.
> > Paying close attention to your SSD utilization with atop or iostat,
> > etc.
> > 
> > Christian
> > 
> >> Regards,
> >> 
> >> Cameron Scrace
> >> Infrastructure Engineer
> >> 
> >> Mobile +64 22 610 4629
> >> Phone  +64 4 462 5085 
> >> Email  cameron.scrace@xxxxxxxxxxxx
> >> Solnet Solutions Limited
> >> Level 12, Solnet House
> >> 70 The Terrace, Wellington 6011
> >> PO Box 397, Wellington 6140
> >> 
> >> www.solnet.co.nz
> >> 
> >> 
> >> 
> >> From:   Christian Balzer <chibi@xxxxxxx>
> >> To:     "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx>
> >> Cc:     Cameron.Scrace@xxxxxxxxxxxx
> >> Date:   08/06/2015 02:40 p.m.
> >> Subject:        Re:  Multiple journals and an OSD on one
> >> SSD doable?
> >> 
> >> 
> >> 
> >> On Mon, 8 Jun 2015 14:30:17 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote:
> >> 
> >>> Thanks for all the feedback. 
> >>> 
> >>> What makes the EVOs unusable? They should have plenty of speed but
> >>> your link has them at 1.9MB/s, is it just the way they handle
> >>> O_DIRECT and D_SYNC? 
> >>> 
> >> Precisely. 
> >> Read that ML thread for details.
> >> 
> >> And once more, they also are not very endurable.
> >> So depending on your usage pattern and Ceph (Ceph itself and the
> >> underlying FS) write amplification their TBW/$ will be horrible,
> >> costing you more in the end than more expensive, but an order of
> >> magnitude more endurable DC SSDs. 
> >> 
> >>> Not sure if we will be able to spend anymore, we may just have to
> >>> take the performance hit until we can get more money for the project.
> >>> 
> >> You could cheap out with 200GB DC S3700s (half the price), but they
> >> will definitely become the bottleneck at a combined max speed of about
> >> 700MB/s, as opposed to the 400GB ones at 900MB/s combined.
> >> 
> >> Christian
> >> 
> >>> Thanks,
> >>> 
> >>> Cameron Scrace
> >>> Infrastructure Engineer
> >>> 
> >>> Mobile +64 22 610 4629
> >>> Phone  +64 4 462 5085 
> >>> Email  cameron.scrace@xxxxxxxxxxxx
> >>> Solnet Solutions Limited
> >>> Level 12, Solnet House
> >>> 70 The Terrace, Wellington 6011
> >>> PO Box 397, Wellington 6140
> >>> 
> >>> www.solnet.co.nz
> >>> 
> >>> 
> >>> 
> >>> From:   Christian Balzer <chibi@xxxxxxx>
> >>> To:     "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx>
> >>> Cc:     Cameron.Scrace@xxxxxxxxxxxx
> >>> Date:   08/06/2015 02:00 p.m.
> >>> Subject:        Re:  Multiple journals and an OSD on one
> >>> SSD 
> >> 
> >>> doable?
> >>> 
> >>> 
> >>> 
> >>> 
> >>> Cameron,
> >>> 
> >>> To offer at least some constructive advice here instead of just all
> >>> doom and gloom, here's what I'd do:
> >>> 
> >>> Replace the OS SSDs with 2 400GB Intel DC S3700s (or S3710s).
> >>> They have enough BW to nearly saturate your network.
> >>> 
> >>> Put all your journals on them (3 SSD OSD and 3 HDD OSD per). 
> >>> While that's a bad move from a failure domain perspective, your
> >>> budget probably won't allow for anything better and those are VERY
> >>> reliable and just as important durable SSDs. 
> >>> 
> >>> This will give you the speed your current setup is capable of,
> >>> probably limited by the CPU when it comes to SSD pool operations.
> >>> 
> >>> Christian
> >>> 
> >>> On Mon, 8 Jun 2015 10:44:06 +0900 Christian Balzer wrote:
> >>> 
> >>>> 
> >>>> Hello Cameron,
> >>>> 
> >>>> On Mon, 8 Jun 2015 13:13:33 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote:
> >>>> 
> >>>>> Hi Christian,
> >>>>> 
> >>>>> Yes we have purchased all our hardware, was very hard to convince 
> >>>>> management/finance to approve it, so some of the stuff we have is a
> >>>>> bit cheap.
> >>>>> 
> >>>> Unfortunate. Both the done deal and the cheapness. 
> >>>> 
> >>>>> We have four storage nodes each with 6 x 6TB Western Digital Red
> >>>>> SATA Drives (WD60EFRX-68M) and 6 x 1TB Samsung EVO 850s SSDs and
> >>>>> 2x250GB Samsung EVO 850s (for OS raid).
> >>>>> CPUs are Intel Atom C2750  @ 2.40GHz (8 Cores) with 32 GB of RAM. 
> >>>>> We have a 10Gig Network.
> >>>>> 
> >>>> I wish there was a nice way to say this, but it unfortunately boils
> >>>> down to a "You're fooked".
> >>>> 
> >>>> There have been many discussions about which SSDs are usable with 
> >> Ceph,
> >>>> very recently as well.
> >>>> Samsung EVOs (the non DC type for sure) are basically unusable for
> >>>> journals. See the recent thread:
> >>>> Possible improvements for a slow write speed (excluding independent
> >>>> SSD journals) and:
> >>>> 
> >>> 
> >> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
> >> 
> >>> 
> >>>> for reference.
> >>>> 
> >>>> I presume your intention for the 1TB SSDs is for a SSD backed pool? 
> >>>> Note that the EVOs have a pretty low (guaranteed) endurance, so
> >>>> aside from needing journal SSDs that actually can do the job, you're
> >>>> looking 
> >> 
> >>> at
> >>>> wearing them out rather quickly (depending on your use case of 
> >> course).
> >>>> 
> >>>> Now with SSD based OSDs or even HDD based OSDs with SSD journals
> >>>> that 
> >>> CPU
> >>>> looks a bit anemic.
> >>>> 
> >>>> More below:
> >>>>> The two options we are considering are:
> >>>>> 
> >>>>> 1) Use two of the 1TB SSDs for the spinning disk journals (3 each)
> >>>>> and 
> >>> 
> >>>>> then use the remaining 900+GB of each drive as an OSD to be part of
> >>>>> the cache pool.
> >>>>> 
> >>>>> 2) Put the spinning disk journals on the OS SSDs and use the 2 1TB
> >>>>> SSDs for the cache pool.
> >>>>> 
> >>>> Cache pools aren't all that speedy currently (research the ML
> >>>> archives), even less so with the SSDs you have.
> >>>> 
> >>>> Christian
> >>>> 
> >>>>> In both cases the other 4 1TB SSDs will be part of their own tier.
> >>>>> 
> >>>>> Thanks a lot!
> >>>>> 
> >>>>> Cameron Scrace
> >>>>> Infrastructure Engineer
> >>>>> 
> >>>>> Mobile +64 22 610 4629
> >>>>> Phone  +64 4 462 5085 
> >>>>> Email  cameron.scrace@xxxxxxxxxxxx
> >>>>> Solnet Solutions Limited
> >>>>> Level 12, Solnet House
> >>>>> 70 The Terrace, Wellington 6011
> >>>>> PO Box 397, Wellington 6140
> >>>>> 
> >>>>> www.solnet.co.nz
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> From:   Christian Balzer <chibi@xxxxxxx>
> >>>>> To:     "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx>
> >>>>> Cc:     Cameron.Scrace@xxxxxxxxxxxx
> >>>>> Date:   08/06/2015 12:18 p.m.
> >>>>> Subject:        Re:  Multiple journals and an OSD on
> >>>>> one SSD doable?
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> Hello,
> >>>>> 
> >>>>> 
> >>>>> On Mon, 8 Jun 2015 09:55:56 +1200 Cameron.Scrace@xxxxxxxxxxxx
> >>>>> wrote:
> >>>>> 
> >>>>>> The other option we were considering was putting the journals on
> >>>>>> the OS SSDs, they are only 250GB and the rest would be for the
> >>>>>> OS. Is that a decent option?
> >>>>>> 
> >>>>> You'll be getting a LOT better advice if you're telling us more
> >>>>> details.
> >>>>> 
> >>>>> For starters, have you bought the hardware yet?
> >>>>> Tell us about your design, how many initial storage nodes, how many
> >>>>> HDDs/SSDs per node, what CPUs/RAM/network?
> >>>>> 
> >>>>> What SSDs are we talking about, exact models please.
> >>>>> (Both the sizes you mentioned do not ring a bell for DC level SSDs
> >>>>> I'm aware of)
> >>>>> 
> >>>>> That said, I'm using Intel DC S3700s for mixed OS and journal use
> >>>>> with 
> >>> 
> >>>>> good
> >>>>> results. 
> >>>>> In your average Ceph storage node, normal OS (logging mostly)
> >>>>> activity is a
> >>>>> minute drop in the bucket for any decent SSD, so nearly all of it's
> >>>>> resources are available to journals.
> >>>>> 
> >>>>> You want to match the number of journals per SSD according to the
> >>>>> capabilities of your SSD, HDDs and network.
> >>>>> 
> >>>>> For example 8 HDD OSDs with 2 200GB DC S3700 and a 10Gb/s network
> >>>>> is a decent match. 
> >>>>> The two SSDs at 900MB/s would appear to be the bottleneck, but in
> >>>>> reality I'd expect the HDDs to be it.
> >>>>> Never mind that you'd be more likely to be IOPS than bandwidth 
> >> bound.
> >>>>> 
> >>>>> Regards,
> >>>>> 
> >>>>> Christian
> >>>>> 
> >>>>>> Thanks!
> >>>>>> 
> >>>>>> Cameron Scrace
> >>>>>> Infrastructure Engineer
> >>>>>> 
> >>>>>> Mobile +64 22 610 4629
> >>>>>> Phone  +64 4 462 5085 
> >>>>>> Email  cameron.scrace@xxxxxxxxxxxx
> >>>>>> Solnet Solutions Limited
> >>>>>> Level 12, Solnet House
> >>>>>> 70 The Terrace, Wellington 6011
> >>>>>> PO Box 397, Wellington 6140
> >>>>>> 
> >>>>>> www.solnet.co.nz
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> From:   Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> >>>>>> To:     "Cameron.Scrace@xxxxxxxxxxxx"
> >>>>>> <Cameron.Scrace@xxxxxxxxxxxx>, 
> >>> 
> >>>>>> "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx>
> >>>>>> Date:   08/06/2015 09:34 a.m.
> >>>>>> Subject:        RE:  Multiple journals and an OSD on
> >>>>>> one SSD 
> >>>>> 
> >>>>>> doable?
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> Cameron,
> >>>>>> Generally, it’s not a good idea. 
> >>>>>> You want to protect your SSDs used as journal.If any problem on
> >>>>>> that disk, you will be losing all of your dependent OSDs.
> >>>>>> I don’t think a bigger journal will gain you much performance ,
> >>>>>> so, default 5 GB journal size should be good enough. If you want 
> >> to
> >>>>>> reduce the fault domain and want to put 3 journals on a SSD , go
> >>>>>> for minimum size and high endurance SSDs for that.
> >>>>>> Now, if you want to use your rest of space of 1 TB ssd, creating 
> >>> just
> >>>>>> OSDs will not gain you much (rather may get some burst
> >>>>>> performance). You may want to consider the following.
> >>>>>> 
> >>>>>> 1. If your spindle OSD size is much bigger than 900 GB , you
> >>>>>> don’t want to make all OSDs of similar sizes, cache pool could
> >>>>>> be one of your option. But, remember, cache pool can wear out
> >>>>>> your SSDs faster as presently I guess it is not optimizing the
> >>>>>> extra writes. Sorry, I don’t have exact data as I am yet to test
> >>>>>> that out.
> >>>>>> 
> >>>>>> 2. If you want to make all the OSDs of similar sizes and you will
> >>>>>> be able to create a substantial number of OSDs with your unused
> >>>>>> SSDs (depends on how big the cluster is), you may want to put all
> >>>>>> of your primary OSDs to SSD and gain significant performance
> >>>>>> boost for read. Also, in this case, I don’t think you will be
> >>>>>> getting any burst performance. 
> >>>>>> Thanks & Regards
> >>>>>> Somnath
> >>>>>> 
> >>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On 
> >>> Behalf
> >>>>>> Of 
> >>>>> 
> >>>>>> Cameron.Scrace@xxxxxxxxxxxx
> >>>>>> Sent: Sunday, June 07, 2015 1:49 PM
> >>>>>> To: ceph-users@xxxxxxxx
> >>>>>> Subject:  Multiple journals and an OSD on one SSD 
> >>> doable?
> >>>>>> 
> >>>>>> Setting up a Ceph cluster and we want the journals for our 
> >> spinning
> >>>>>> disks to be on SSDs but all of our SSDs are 1TB. We were planning
> >>>>>> on putting 3 journals on each SSD, but that leaves 900+GB unused
> >>>>>> on the drive, is it possible to use the leftover space as another
> >>>>>> OSD or will it affect performance too much? 
> >>>>>> 
> >>>>>> Thanks, 
> >>>>>> 
> >>>>>> Cameron Scrace
> >>>>>> Infrastructure Engineer
> >>>>>> 
> >>>>>> Mobile +64 22 610 4629
> >>>>>> Phone  +64 4 462 5085 
> >>>>>> Email  cameron.scrace@xxxxxxxxxxxx
> >>>>>> Solnet Solutions Limited
> >>>>>> Level 12, Solnet House
> >>>>>> 70 The Terrace, Wellington 6011
> >>>>>> PO Box 397, Wellington 6140
> >>>>>> 
> >>>>>> www.solnet.co.nzAttention: This email may contain information
> >>>>>> intended for the sole use of the original recipient. Please 
> >> respect
> >>>>>> this when sharing or disclosing this email's contents with any
> >>>>>> third party. If you believe you have received this email in
> >>>>>> error, please delete it and notify the sender or
> >>>>>> postmaster@xxxxxxxxxxxxxxxxxxxxx as soon as possible. The content
> >>>>>> of this email does not necessarily reflect the views of Solnet
> >>>>>> Solutions Ltd. 
> >>>>>> 
> >>>>>> 
> >>>>>> PLEASE NOTE: The information contained in this electronic mail
> >>>>>> message is intended only for the use of the designated 
> >> recipient(s)
> >>>>>> named above. If the reader of this message is not the intended
> >>>>>> recipient, you are hereby notified that you have received this
> >>>>>> message in error and that any review, dissemination,
> >>>>>> distribution, or copying of this message is strictly prohibited.
> >>>>>> If you have received this communication in error, please notify
> >>>>>> the sender by telephone or e-mail (as shown above) immediately
> >>>>>> and destroy any and all copies of this message in your
> >>>>>> possession (whether hard copies or electronically stored copies).
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> Attention:
> >>>>>> This email may contain information intended for the sole use of
> >>>>>> the original recipient. Please respect this when sharing or
> >>>>>> disclosing this email's contents with any third party. If you
> >>>>>> believe you have received this email in error, please delete it
> >>>>>> and notify the sender or postmaster@xxxxxxxxxxxxxxxxxxxxx as
> >>>>>> soon as possible. The content of this email does not necessarily
> >>>>>> reflect the views of Solnet Solutions Ltd.
> >>>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >>>> 
> >>> 
> >>> 
> >> 
> >> 
> > 
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux