Re: SSDs for journals vs SSDs for a cache tier, which is better?

Heath Albritton <halbritt@xxxxxxxx> · Wed, 16 Mar 2016 09:58:15 -0700

The rule of thumb is to match the journal throughput to the OSD throughout.  I'm seeing ~180MB/s sequential write on my OSDs and I'm using one of the P3700 400GB units per six OSDs.  The 400GB P3700 yields around 1200MB/s* and has around 1/10th the latency of any SATA SSD I've tested.

I put a pair of them in a 12-drive chassis and get excellent performance.  One could probably do the same in an 18-drive chassis without any issues.  Failure domain for a journal starts to get pretty large at they point.  I have dozens of the "Fultondale" SSDs deployed and have had zero failures.  Endurance is excellent, etc.

*the larger units yield much better write throughout but don't make sense financially for journals.

-H

On Mar 16, 2016, at 09:37, Nick Fisk <nick@xxxxxxxxxx> wrote:

>> -----Original Message-----
>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
>> Stephen Harker
>> Sent: 16 March 2016 16:22
>> To: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  SSDs for journals vs SSDs for a cache tier,
> which is
>> better?
>> 
>>> On 2016-02-17 11:07, Christian Balzer wrote:
>>> 
>>> On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote:
>>> 
>>>>>> Let's consider both cases:
>>>>>> Journals on SSDs - for writes, the write operation returns right
>>>>>> after data lands on the Journal's SSDs, but before it's written
>>>>>> to the backing HDD. So, for writes, SSD journal approach should
>>>>>> be comparable to having a SSD cache tier.
>>>>> Not quite, see below.
>>>> Could you elaborate a bit more?
>>>> 
>>>> Are you saying that with a Journal on a SSD writes from clients,
>>>> before they can return from the operation to the client, must end up
>>>> on both the SSD (Journal) *and* HDD (actual data store behind that
>>>> journal)?
>>> 
>>> No, your initial statement is correct.
>>> 
>>> However that burst of speed doesn't last indefinitely.
>>> 
>>> Aside from the size of the journal (which is incidentally NOT the most
>>> limiting factor) there are various "filestore" parameters in Ceph, in
>>> particular the sync interval ones.
>>> There was a more in-depth explanation by a developer about this in
>>> this ML, try your google-foo.
>>> 
>>> For short bursts of activity, the journal helps a LOT.
>>> If you send a huge number of for example 4KB writes to your cluster,
>>> the speed will eventually (after a few seconds) go down to what your
>>> backing storage (HDDs) are capable of sustaining.
>>> 
>>>>> (Which SSDs do you plan to use anyway?)
>>>> 
>>>> Intel DC S3700
>>> Good choice, with the 200GB model prefer the 3700 over the 3710
>>> (higher sequential write speed).
>> 
>> Hi All,
>> 
>> I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes,
> each
>> of which has 6 4TB SATA drives within. I had my eye on these:
>> 
>> 400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0
>> 
>> but reading through this thread, it might be better to go with the P3700
> given
>> the improved iops. So a couple of questions.
>> 
>> * Are the PCI-E versions of these drives different in any other way than
> the
>> interface?
> 
> Yes and no. Internally they are probably not much difference, but the
> NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of minimum
> latency and bandwidth.
> 
>> 
>> * Would one of these as a journal for 6 4TB OSDs be overkill (connectivity
> is
>> 10GE, or will be shortly anyway), would the SATA S3700 be sufficient?
> 
> Again depends on your use case. The S3700 may suffer if you are doing large
> sequential writes, it might not have a high enough sequential write speed
> and will become the bottleneck. 6 Disks could potentially take around
> 500-700MB/s of writes. A P3700 will have enough and will give slightly lower
> write latency as well if this is important. You may even be able to run more
> than 6 disk OSD's on it if needed.
> 
>> 
>> Given they're not hot-swappable, it'd be good if they didn't wear out in
>> 6 months too.
> 
> Probably won't unless you are doing some really extreme write workloads and
> even then I would imagine they would last 1-2 years.
> 
>> 
>> I realise I've not given you much to go on and I'm Googling around as
> well, I'm
>> really just asking in case someone has tried this already and has some
>> feedback or advice..
> 
> That's ok, I'm currently running S3700 100GB's on current cluster and new
> cluster that's in planning stages will be using the 400Gb P3700's.
> 
>> 
>> Thanks! :)
>> 
>> Stephen
>> 
>> --
>> Stephen Harker
>> Chief Technology Officer
>> The Positive Internet Company.
>> 
>> --
>> All postal correspondence to:
>> The Positive Internet Company, 24 Ganton Street, London. W1F 7QY
>> 
>> *Follow us on Twitter* @posipeople
>> 
>> The Positive Internet Company Limited is registered in England and Wales.
>> Registered company number: 3673639. VAT no: 726 7072 28.
>> Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4
> 9EE.
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com