Re: SSDs for journals vs SSDs for a cache tier, which is better?

Christian Balzer <chibi@xxxxxxx> · Thu, 17 Mar 2016 10:50:56 +0900

Hello,

On Wed, 16 Mar 2016 16:22:06 +0000 Stephen Harker wrote:

> On 2016-02-17 11:07, Christian Balzer wrote:
> > 
> > On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote:
> > 
> >> > > Let's consider both cases:
> >> > > Journals on SSDs - for writes, the write operation returns right
> >> > > after data lands on the Journal's SSDs, but before it's written to
> >> > > the backing HDD. So, for writes, SSD journal approach should be
> >> > > comparable to having a SSD cache tier.
> >> > Not quite, see below.
> >> >
> >> >
> >> Could you elaborate a bit more?
> >> 
> >> Are you saying that with a Journal on a SSD writes from clients, 
> >> before
> >> they can return from the operation to the client, must end up on both 
> >> the
> >> SSD (Journal) *and* HDD (actual data store behind that journal)?
> > 
> > No, your initial statement is correct.
> > 
> > However that burst of speed doesn't last indefinitely.
> > 
> > Aside from the size of the journal (which is incidentally NOT the most
> > limiting factor) there are various "filestore" parameters in Ceph, in
> > particular the sync interval ones.
> > There was a more in-depth explanation by a developer about this in
> > this ML,
> > try your google-foo.
> > 
> > For short bursts of activity, the journal helps a LOT.
> > If you send a huge number of for example 4KB writes to your cluster, 
> > the
> > speed will eventually (after a few seconds) go down to what your 
> > backing
> > storage (HDDs) are capable of sustaining.
> > 
> >> > (Which SSDs do you plan to use anyway?)
> >> >
> >> 
> >> Intel DC S3700
> >> 
> > Good choice, with the 200GB model prefer the 3700 over the 3710 (higher
> > sequential write speed).
> 
> Hi All,
> 
> I am looking at using PCI-E SSDs as journals in our (4) Ceph OSD nodes, 
> each of which has 6 4TB SATA drives within. I had my eye on these:
> 
> 400GB Intel P3500 DC AIC SSD, HHHL PCIe 3.0
> 
> but reading through this thread, it might be better to go with the P3700 
> given the improved iops. So a couple of questions.
> 
The 3700's will also last significantly longer than the 3500's.
IOPS (of the device) are mostly irrelevant, sequential write speed is
where it's at.
In the same vein, remember that journals are never ever read from unless
there was a crash.

> * Are the PCI-E versions of these drives different in any other way than 
> the interface?
> 
> * Would one of these as a journal for 6 4TB OSDs be overkill 
> (connectivity is 10GE, or will be shortly anyway), would the SATA S3700 
> be sufficient?
> 
Overkill, but not insanely so.

>From my (not insignificant) experience you want to match your journal(s)
firstly towards your network speed and then the devices behind them.

A SATA HDD can write indeed about 180MB/s sequentially, but that's firmly
in the land of theory when it comes to Ceph.

Ceph/RBD writes are 4MB objects at the largest, they are spread out all
over the cluster and of course most likely interspersed with competing
(seeking) reads and other writes to the same OSD.
That is before all the IO and thus seeks needed for for file system
operations, LevelDB updates, etc.
I thus spec my journals to 100MB/s write speed per SATA based HDD and
that's already generous.

Concrete case in point, 4 node cluster, 4 DC S3700 100GB SSDs with 2
journals each, 8 7.2k 3TB SATA HDDs, Infiniband network. 
That cluster is very lightly loaded.

Doing this fio from a client VM:
---
fio --size=6G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4M --iodepth=32
---
and watching all 4 nodes simultaneously with atop shows us that the HDDs
are pushed up to around 80% utilization while writing only about 50MB/s.
The journal SSDs (which can handle 200MB/s writes) are consequently
semi-bored at about 45% utilization writing around 95MB/s.

As others mentioned, the P series will give you significantly lower
latencies if that's important in your use case (small writes that in their
sum do not exceed the abilities  of your backing storage and CPUs).

Also a lot of this depends on your actual HW (cases), how many hot-swap
bays do you have, how many free PCIe slots, etc.
With entirely new HW you could go for something that has 1-2 NVMe hot-swap
bays and get the best of both worlds.

Summing things up, the 400GB P3700 matches your network speed and thus can
deal with short bursts at full speed. 
However it is overkill for your 6 HDDs, especially once they get busy
(like backfilling or tests as above). 
I'd be surprised to see them handle more than 400MB/s writes combined. 

If you're trying to economize, a single 200GB DC S3700 or 2 100GB ones
(smaller failure domains) should do the trick, too.

> Given they're not hot-swappable, it'd be good if they didn't wear out in 
> 6 months too.
> 
See above. 
I haven't been able to make more than 1% impact in the media wearout of
200GB DC S3700s that receive a constant write stream of 3MB/s over 500
days of operation.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com