Re: SSDs for journals vs SSDs for a cache tier, which is better?

Stephen Harker <stephen@xxxxxxxxxxxxxxxxxxxxx> · Wed, 16 Mar 2016 23:54:48 +0000

Thanks all for your suggestions and advice. I'll let you know how it 
goes :)

Stephen

On 2016-03-16 16:58, Heath Albritton wrote:
The rule of thumb is to match the journal throughput to the OSD
throughout.  I'm seeing ~180MB/s sequential write on my OSDs and I'm
using one of the P3700 400GB units per six OSDs.  The 400GB P3700
yields around 1200MB/s* and has around 1/10th the latency of any SATA
SSD I've tested.

I put a pair of them in a 12-drive chassis and get excellent
performance.  One could probably do the same in an 18-drive chassis
without any issues.  Failure domain for a journal starts to get pretty
large at they point.  I have dozens of the "Fultondale" SSDs deployed
and have had zero failures.  Endurance is excellent, etc.

*the larger units yield much better write throughout but don't make
sense financially for journals.

-H

On Mar 16, 2016, at 09:37, Nick Fisk <nick@xxxxxxxxxx> wrote:

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf 
Of
Stephen Harker
Sent: 16 March 2016 16:22
To: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  SSDs for journals vs SSDs for a cache tier,
which is
better?

On 2016-02-17 11:07, Christian Balzer wrote:

On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote:

Let's consider both cases:
Journals on SSDs - for writes, the write operation returns right
after data lands on the Journal's SSDs, but before it's written
to the backing HDD. So, for writes, SSD journal approach should
be comparable to having a SSD cache tier.
Not quite, see below.
Could you elaborate a bit more?

Are you saying that with a Journal on a SSD writes from clients,
before they can return from the operation to the client, must end 
up
on both the SSD (Journal) *and* HDD (actual data store behind that
journal)?

No, your initial statement is correct.

However that burst of speed doesn't last indefinitely.

Aside from the size of the journal (which is incidentally NOT the 
most
limiting factor) there are various "filestore" parameters in Ceph, 
in
particular the sync interval ones.
There was a more in-depth explanation by a developer about this in
this ML, try your google-foo.

For short bursts of activity, the journal helps a LOT.
If you send a huge number of for example 4KB writes to your cluster,
the speed will eventually (after a few seconds) go down to what your
backing storage (HDDs) are capable of sustaining.

(Which SSDs do you plan to use anyway?)

Intel DC S3700
Good choice, with the 200GB model prefer the 3700 over the 3710
(higher sequential write speed).

Hi All,

I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD 
nodes,
each
of which has 6 4TB SATA drives within. I had my eye on these:

400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0

but reading through this thread, it might be better to go with the 
P3700
given
the improved iops. So a couple of questions.

* Are the PCI-E versions of these drives different in any other way 
than
the
interface?

Yes and no. Internally they are probably not much difference, but the
NVME/PCIE interface is a lot faster than SATA/SAS, both in terms of 
minimum
latency and bandwidth.

* Would one of these as a journal for 6 4TB OSDs be overkill 
(connectivity
is
10GE, or will be shortly anyway), would the SATA S3700 be sufficient?

Again depends on your use case. The S3700 may suffer if you are doing 
large
sequential writes, it might not have a high enough sequential write 
speed
and will become the bottleneck. 6 Disks could potentially take around
500-700MB/s of writes. A P3700 will have enough and will give slightly 
lower
write latency as well if this is important. You may even be able to 
run more
than 6 disk OSD's on it if needed.

Given they're not hot-swappable, it'd be good if they didn't wear out 
in
6 months too.

Probably won't unless you are doing some really extreme write 
workloads and
even then I would imagine they would last 1-2 years.

I realise I've not given you much to go on and I'm Googling around as
well, I'm
really just asking in case someone has tried this already and has 
some
feedback or advice..

That's ok, I'm currently running S3700 100GB's on current cluster and 
new
cluster that's in planning stages will be using the 400Gb P3700's.

Thanks! :)

Stephen

--
Stephen Harker
Chief Technology Officer
The Positive Internet Company.

--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and 
Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, 
EN4
9EE.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
All postal correspondence to:
The Positive Internet Company, 24 Ganton Street, London. W1F 7QY

*Follow us on Twitter* @posipeople

The Positive Internet Company Limited is registered in England and Wales.
Registered company number: 3673639. VAT no: 726 7072 28.
Registered office: Northside House, Mount Pleasant, Barnet, Herts, EN4 9EE.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com