Re: Multiple journals and an OSD on one SSD doable?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On 08 Jun 2015, at 10:07, Christian Balzer <chibi@xxxxxxx> wrote:
> 
> On Mon, 8 Jun 2015 09:44:54 +0200 Jan Schermer wrote:
> 
>> I recently did some testing of a few SSDs and found some surprising, and
>> some not so surprising things:
>> 
>> 1) performance varies wildly with firmware, especially with cheaper
>> drives 
>> 2) performance varies with time - even with S3700 - slows down
>> after ~40-80GB and then creeps back up 
>> 3) cheaper drives have almost no gain from bigger queues (at least with
>> fio iodepth=XX) 
>> 4) all drives I tested reach higher IOPS for direct and
>> synchronous writes with iodepth=1 when write cache is *DISABLED* (even
>> S3700)
>> - I suspect compression and write coalescing are disabled
>> - Intel S3700 can reach almost the same IOPS with higher queue depths,
>> but that’s sadly not a real scenario
>> - in any case, disabling write cache doesn’t help real workloads on my
>> cluster 
> 
>> 5) write amplification is worst for synchronous direct writes,
>> so it’s much better to collocate journal and data on the same SSD if you
>> worry about DWPD endurance rating
>> 
> Or in his case _not_ have them on the low endurance EVOs. ^_-

Not necesarrily - drives have to write the data sooner or later, writing the same block twice doesn’t always mean draining DWPD rating at twice the rate, in fact if it writes with 32kb blocks it might almost the same because SSDc have much larger blocks anyway…

if there is an async IO queued (data) and another synchronous IO comes for the journal, then it might be a single atomic operation on the SSD because it has to flush everything it has.


> 
>> and bottom line
>> 6) CEPH doesn’t utilize the SSD capabilities at all on my Dumpling - I
>> hope it will be better on Giant :) 7) ext4 has much higher throughput
>> for my benchmarks when not using a raw device 8) I lose ~50% IOPS on XFS
>> compared to block device - ext4 loses ~10%
>> 
> I can confirm that from my old (emperor days) comparison between EXT4 and
> XFS.

Do you have the benchmark somewhere? This wasn’t really what I was testing, some number would be very helpful…
Did you test the various ext4 options?

Jan

> 
> Christian
> 
>> btw the real average request size on my SSDs is only about 32KiB
>> (journal+data on the same device)
>> 
>> Jan
>> 
>>> On 08 Jun 2015, at 09:33, Christian Balzer <chibi@xxxxxxx> wrote:
>>> 
>>> 
>>> Hello,
>>> 
>>> On Mon, 8 Jun 2015 18:01:28 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote:
>>> 
>>>> Just used the method in the link you sent me to test one of the EVO
>>>> 850s, with one job it reached a speed of around 2.5MB/s but it didn't
>>>> max out until around 32 jobs at 24MB/s: 
>>>> 
>>> I'm not the author of that page, nor did I verify that they did use a
>>> uniform methodology/environment, nor do I think that this test is a
>>> particular close approach to what I see Ceph doing in reality (it seems
>>> to write much larger chunks that 4KB).
>>> I'd suggest keeping numjobs to 1 and ramping up the block size to 4MB,
>>> see where you max out with that. 
>>> I can reach the theoretical max speed of my SSDs (350MB/s) at 4MB
>>> blocks, but it's already at 90% with 1MB.
>>> 
>>> That test does however produce interesting numbers that seem to be
>>> consistent by themselves and match what people have reporting here.
>>> 
>>> I can get 22MB/s with just one fio job with the above setting (alas on
>>> filesystem, no spare raw partition right now) on a DC S3700 200GB SSD,
>>> directly conntected to an onboard Intel SATA-3 port.
>>> 
>>> Now think what that means, a fio numjob is equivalent to an OSD
>>> daemon, so in this worst case 4KB scenario my journal and thus OSD
>>> would be 10 times faster than yours.
>>> Food for thought.
>>> 
>>>> sudo fio --filename=/dev/sdh --direct=1 --sync=1 --rw=write --bs=4k 
>>>> --numjobs=32 --iodepth=1 --runtime=60 --time_based --group_reporting 
>>>> --name=journal-test
>>>> write: io=1507.4MB, bw=25723KB/s, iops=6430, runt= 60007msec
>>>> 
>>>> Also tested a Micron 550 we had sitting around and it maxed out at 
>>>> 2.5mb/s, both results conflict with the chart
>>>> 
>>> Note that they disabled on SSD and controller caches, the former is of
>>> course messing things up where this isn't needed.
>>> 
>>> I'd suggest you go and do a test install of Ceph with your HW and test
>>> that.
>>> Paying close attention to your SSD utilization with atop or iostat,
>>> etc.
>>> 
>>> Christian
>>> 
>>>> Regards,
>>>> 
>>>> Cameron Scrace
>>>> Infrastructure Engineer
>>>> 
>>>> Mobile +64 22 610 4629
>>>> Phone  +64 4 462 5085 
>>>> Email  cameron.scrace@xxxxxxxxxxxx
>>>> Solnet Solutions Limited
>>>> Level 12, Solnet House
>>>> 70 The Terrace, Wellington 6011
>>>> PO Box 397, Wellington 6140
>>>> 
>>>> www.solnet.co.nz
>>>> 
>>>> 
>>>> 
>>>> From:   Christian Balzer <chibi@xxxxxxx>
>>>> To:     "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx>
>>>> Cc:     Cameron.Scrace@xxxxxxxxxxxx
>>>> Date:   08/06/2015 02:40 p.m.
>>>> Subject:        Re:  Multiple journals and an OSD on one
>>>> SSD doable?
>>>> 
>>>> 
>>>> 
>>>> On Mon, 8 Jun 2015 14:30:17 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote:
>>>> 
>>>>> Thanks for all the feedback. 
>>>>> 
>>>>> What makes the EVOs unusable? They should have plenty of speed but
>>>>> your link has them at 1.9MB/s, is it just the way they handle
>>>>> O_DIRECT and D_SYNC? 
>>>>> 
>>>> Precisely. 
>>>> Read that ML thread for details.
>>>> 
>>>> And once more, they also are not very endurable.
>>>> So depending on your usage pattern and Ceph (Ceph itself and the
>>>> underlying FS) write amplification their TBW/$ will be horrible,
>>>> costing you more in the end than more expensive, but an order of
>>>> magnitude more endurable DC SSDs. 
>>>> 
>>>>> Not sure if we will be able to spend anymore, we may just have to
>>>>> take the performance hit until we can get more money for the project.
>>>>> 
>>>> You could cheap out with 200GB DC S3700s (half the price), but they
>>>> will definitely become the bottleneck at a combined max speed of about
>>>> 700MB/s, as opposed to the 400GB ones at 900MB/s combined.
>>>> 
>>>> Christian
>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Cameron Scrace
>>>>> Infrastructure Engineer
>>>>> 
>>>>> Mobile +64 22 610 4629
>>>>> Phone  +64 4 462 5085 
>>>>> Email  cameron.scrace@xxxxxxxxxxxx
>>>>> Solnet Solutions Limited
>>>>> Level 12, Solnet House
>>>>> 70 The Terrace, Wellington 6011
>>>>> PO Box 397, Wellington 6140
>>>>> 
>>>>> www.solnet.co.nz
>>>>> 
>>>>> 
>>>>> 
>>>>> From:   Christian Balzer <chibi@xxxxxxx>
>>>>> To:     "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx>
>>>>> Cc:     Cameron.Scrace@xxxxxxxxxxxx
>>>>> Date:   08/06/2015 02:00 p.m.
>>>>> Subject:        Re:  Multiple journals and an OSD on one
>>>>> SSD 
>>>> 
>>>>> doable?
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Cameron,
>>>>> 
>>>>> To offer at least some constructive advice here instead of just all
>>>>> doom and gloom, here's what I'd do:
>>>>> 
>>>>> Replace the OS SSDs with 2 400GB Intel DC S3700s (or S3710s).
>>>>> They have enough BW to nearly saturate your network.
>>>>> 
>>>>> Put all your journals on them (3 SSD OSD and 3 HDD OSD per). 
>>>>> While that's a bad move from a failure domain perspective, your
>>>>> budget probably won't allow for anything better and those are VERY
>>>>> reliable and just as important durable SSDs. 
>>>>> 
>>>>> This will give you the speed your current setup is capable of,
>>>>> probably limited by the CPU when it comes to SSD pool operations.
>>>>> 
>>>>> Christian
>>>>> 
>>>>> On Mon, 8 Jun 2015 10:44:06 +0900 Christian Balzer wrote:
>>>>> 
>>>>>> 
>>>>>> Hello Cameron,
>>>>>> 
>>>>>> On Mon, 8 Jun 2015 13:13:33 +1200 Cameron.Scrace@xxxxxxxxxxxx wrote:
>>>>>> 
>>>>>>> Hi Christian,
>>>>>>> 
>>>>>>> Yes we have purchased all our hardware, was very hard to convince 
>>>>>>> management/finance to approve it, so some of the stuff we have is a
>>>>>>> bit cheap.
>>>>>>> 
>>>>>> Unfortunate. Both the done deal and the cheapness. 
>>>>>> 
>>>>>>> We have four storage nodes each with 6 x 6TB Western Digital Red
>>>>>>> SATA Drives (WD60EFRX-68M) and 6 x 1TB Samsung EVO 850s SSDs and
>>>>>>> 2x250GB Samsung EVO 850s (for OS raid).
>>>>>>> CPUs are Intel Atom C2750  @ 2.40GHz (8 Cores) with 32 GB of RAM. 
>>>>>>> We have a 10Gig Network.
>>>>>>> 
>>>>>> I wish there was a nice way to say this, but it unfortunately boils
>>>>>> down to a "You're fooked".
>>>>>> 
>>>>>> There have been many discussions about which SSDs are usable with 
>>>> Ceph,
>>>>>> very recently as well.
>>>>>> Samsung EVOs (the non DC type for sure) are basically unusable for
>>>>>> journals. See the recent thread:
>>>>>> Possible improvements for a slow write speed (excluding independent
>>>>>> SSD journals) and:
>>>>>> 
>>>>> 
>>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>>> 
>>>>> 
>>>>>> for reference.
>>>>>> 
>>>>>> I presume your intention for the 1TB SSDs is for a SSD backed pool? 
>>>>>> Note that the EVOs have a pretty low (guaranteed) endurance, so
>>>>>> aside from needing journal SSDs that actually can do the job, you're
>>>>>> looking 
>>>> 
>>>>> at
>>>>>> wearing them out rather quickly (depending on your use case of 
>>>> course).
>>>>>> 
>>>>>> Now with SSD based OSDs or even HDD based OSDs with SSD journals
>>>>>> that 
>>>>> CPU
>>>>>> looks a bit anemic.
>>>>>> 
>>>>>> More below:
>>>>>>> The two options we are considering are:
>>>>>>> 
>>>>>>> 1) Use two of the 1TB SSDs for the spinning disk journals (3 each)
>>>>>>> and 
>>>>> 
>>>>>>> then use the remaining 900+GB of each drive as an OSD to be part of
>>>>>>> the cache pool.
>>>>>>> 
>>>>>>> 2) Put the spinning disk journals on the OS SSDs and use the 2 1TB
>>>>>>> SSDs for the cache pool.
>>>>>>> 
>>>>>> Cache pools aren't all that speedy currently (research the ML
>>>>>> archives), even less so with the SSDs you have.
>>>>>> 
>>>>>> Christian
>>>>>> 
>>>>>>> In both cases the other 4 1TB SSDs will be part of their own tier.
>>>>>>> 
>>>>>>> Thanks a lot!
>>>>>>> 
>>>>>>> Cameron Scrace
>>>>>>> Infrastructure Engineer
>>>>>>> 
>>>>>>> Mobile +64 22 610 4629
>>>>>>> Phone  +64 4 462 5085 
>>>>>>> Email  cameron.scrace@xxxxxxxxxxxx
>>>>>>> Solnet Solutions Limited
>>>>>>> Level 12, Solnet House
>>>>>>> 70 The Terrace, Wellington 6011
>>>>>>> PO Box 397, Wellington 6140
>>>>>>> 
>>>>>>> www.solnet.co.nz
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> From:   Christian Balzer <chibi@xxxxxxx>
>>>>>>> To:     "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx>
>>>>>>> Cc:     Cameron.Scrace@xxxxxxxxxxxx
>>>>>>> Date:   08/06/2015 12:18 p.m.
>>>>>>> Subject:        Re:  Multiple journals and an OSD on
>>>>>>> one SSD doable?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, 8 Jun 2015 09:55:56 +1200 Cameron.Scrace@xxxxxxxxxxxx
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> The other option we were considering was putting the journals on
>>>>>>>> the OS SSDs, they are only 250GB and the rest would be for the
>>>>>>>> OS. Is that a decent option?
>>>>>>>> 
>>>>>>> You'll be getting a LOT better advice if you're telling us more
>>>>>>> details.
>>>>>>> 
>>>>>>> For starters, have you bought the hardware yet?
>>>>>>> Tell us about your design, how many initial storage nodes, how many
>>>>>>> HDDs/SSDs per node, what CPUs/RAM/network?
>>>>>>> 
>>>>>>> What SSDs are we talking about, exact models please.
>>>>>>> (Both the sizes you mentioned do not ring a bell for DC level SSDs
>>>>>>> I'm aware of)
>>>>>>> 
>>>>>>> That said, I'm using Intel DC S3700s for mixed OS and journal use
>>>>>>> with 
>>>>> 
>>>>>>> good
>>>>>>> results. 
>>>>>>> In your average Ceph storage node, normal OS (logging mostly)
>>>>>>> activity is a
>>>>>>> minute drop in the bucket for any decent SSD, so nearly all of it's
>>>>>>> resources are available to journals.
>>>>>>> 
>>>>>>> You want to match the number of journals per SSD according to the
>>>>>>> capabilities of your SSD, HDDs and network.
>>>>>>> 
>>>>>>> For example 8 HDD OSDs with 2 200GB DC S3700 and a 10Gb/s network
>>>>>>> is a decent match. 
>>>>>>> The two SSDs at 900MB/s would appear to be the bottleneck, but in
>>>>>>> reality I'd expect the HDDs to be it.
>>>>>>> Never mind that you'd be more likely to be IOPS than bandwidth 
>>>> bound.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>> Christian
>>>>>>> 
>>>>>>>> Thanks!
>>>>>>>> 
>>>>>>>> Cameron Scrace
>>>>>>>> Infrastructure Engineer
>>>>>>>> 
>>>>>>>> Mobile +64 22 610 4629
>>>>>>>> Phone  +64 4 462 5085 
>>>>>>>> Email  cameron.scrace@xxxxxxxxxxxx
>>>>>>>> Solnet Solutions Limited
>>>>>>>> Level 12, Solnet House
>>>>>>>> 70 The Terrace, Wellington 6011
>>>>>>>> PO Box 397, Wellington 6140
>>>>>>>> 
>>>>>>>> www.solnet.co.nz
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> From:   Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
>>>>>>>> To:     "Cameron.Scrace@xxxxxxxxxxxx"
>>>>>>>> <Cameron.Scrace@xxxxxxxxxxxx>, 
>>>>> 
>>>>>>>> "ceph-users@xxxxxxxx" <ceph-users@xxxxxxxx>
>>>>>>>> Date:   08/06/2015 09:34 a.m.
>>>>>>>> Subject:        RE:  Multiple journals and an OSD on
>>>>>>>> one SSD 
>>>>>>> 
>>>>>>>> doable?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Cameron,
>>>>>>>> Generally, it’s not a good idea. 
>>>>>>>> You want to protect your SSDs used as journal.If any problem on
>>>>>>>> that disk, you will be losing all of your dependent OSDs.
>>>>>>>> I don’t think a bigger journal will gain you much performance ,
>>>>>>>> so, default 5 GB journal size should be good enough. If you want 
>>>> to
>>>>>>>> reduce the fault domain and want to put 3 journals on a SSD , go
>>>>>>>> for minimum size and high endurance SSDs for that.
>>>>>>>> Now, if you want to use your rest of space of 1 TB ssd, creating 
>>>>> just
>>>>>>>> OSDs will not gain you much (rather may get some burst
>>>>>>>> performance). You may want to consider the following.
>>>>>>>> 
>>>>>>>> 1. If your spindle OSD size is much bigger than 900 GB , you
>>>>>>>> don’t want to make all OSDs of similar sizes, cache pool could
>>>>>>>> be one of your option. But, remember, cache pool can wear out
>>>>>>>> your SSDs faster as presently I guess it is not optimizing the
>>>>>>>> extra writes. Sorry, I don’t have exact data as I am yet to test
>>>>>>>> that out.
>>>>>>>> 
>>>>>>>> 2. If you want to make all the OSDs of similar sizes and you will
>>>>>>>> be able to create a substantial number of OSDs with your unused
>>>>>>>> SSDs (depends on how big the cluster is), you may want to put all
>>>>>>>> of your primary OSDs to SSD and gain significant performance
>>>>>>>> boost for read. Also, in this case, I don’t think you will be
>>>>>>>> getting any burst performance. 
>>>>>>>> Thanks & Regards
>>>>>>>> Somnath
>>>>>>>> 
>>>>>>>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On 
>>>>> Behalf
>>>>>>>> Of 
>>>>>>> 
>>>>>>>> Cameron.Scrace@xxxxxxxxxxxx
>>>>>>>> Sent: Sunday, June 07, 2015 1:49 PM
>>>>>>>> To: ceph-users@xxxxxxxx
>>>>>>>> Subject:  Multiple journals and an OSD on one SSD 
>>>>> doable?
>>>>>>>> 
>>>>>>>> Setting up a Ceph cluster and we want the journals for our 
>>>> spinning
>>>>>>>> disks to be on SSDs but all of our SSDs are 1TB. We were planning
>>>>>>>> on putting 3 journals on each SSD, but that leaves 900+GB unused
>>>>>>>> on the drive, is it possible to use the leftover space as another
>>>>>>>> OSD or will it affect performance too much? 
>>>>>>>> 
>>>>>>>> Thanks, 
>>>>>>>> 
>>>>>>>> Cameron Scrace
>>>>>>>> Infrastructure Engineer
>>>>>>>> 
>>>>>>>> Mobile +64 22 610 4629
>>>>>>>> Phone  +64 4 462 5085 
>>>>>>>> Email  cameron.scrace@xxxxxxxxxxxx
>>>>>>>> Solnet Solutions Limited
>>>>>>>> Level 12, Solnet House
>>>>>>>> 70 The Terrace, Wellington 6011
>>>>>>>> PO Box 397, Wellington 6140
>>>>>>>> 
>>>>>>>> www.solnet.co.nzAttention: This email may contain information
>>>>>>>> intended for the sole use of the original recipient. Please 
>>>> respect
>>>>>>>> this when sharing or disclosing this email's contents with any
>>>>>>>> third party. If you believe you have received this email in
>>>>>>>> error, please delete it and notify the sender or
>>>>>>>> postmaster@xxxxxxxxxxxxxxxxxxxxx as soon as possible. The content
>>>>>>>> of this email does not necessarily reflect the views of Solnet
>>>>>>>> Solutions Ltd. 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>>> message is intended only for the use of the designated 
>>>> recipient(s)
>>>>>>>> named above. If the reader of this message is not the intended
>>>>>>>> recipient, you are hereby notified that you have received this
>>>>>>>> message in error and that any review, dissemination,
>>>>>>>> distribution, or copying of this message is strictly prohibited.
>>>>>>>> If you have received this communication in error, please notify
>>>>>>>> the sender by telephone or e-mail (as shown above) immediately
>>>>>>>> and destroy any and all copies of this message in your
>>>>>>>> possession (whether hard copies or electronically stored copies).
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Attention:
>>>>>>>> This email may contain information intended for the sole use of
>>>>>>>> the original recipient. Please respect this when sharing or
>>>>>>>> disclosing this email's contents with any third party. If you
>>>>>>>> believe you have received this email in error, please delete it
>>>>>>>> and notify the sender or postmaster@xxxxxxxxxxxxxxxxxxxxx as
>>>>>>>> soon as possible. The content of this email does not necessarily
>>>>>>>> reflect the views of Solnet Solutions Ltd.
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> -- 
>>> Christian Balzer        Network/Systems Engineer                
>>> chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
>>> http://www.gol.com/
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
> 
> 
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux