Re: Thoughts about SSD journal size

Christian Balzer <chibi@xxxxxxx> · Mon, 28 Mar 2016 09:52:13 +0900

Hello,

On Sun, 27 Mar 2016 18:44:41 +0200 (CEST) Daniel Delin wrote:

> Hi,
> 
> I have ordered three 240GB Samsung SM863 SSD to my 3 OSD hosts, each
> with 4 OSDs, to improve write performance. 

Did you test these SSDs in advance?
While I'm pretty sure they are suitable for Ceph journals I haven't seen
any sync write results for them, so if you did test them or will when you
get them, by all means share those results with us.

> When looking at the docs,
> there is a formula for journal size (osd journal size = {2 * (expected
> throughput * filestore max sync interval)}) that I intend to use. If I
> understand this correctly it would in my case be (2*(4*100MB/s)*5
> seconds)=4GB journal size if I keep the default filestore max sync
> interval of 5 seconds. Since the SSDs are 240GB, I plan to use
> significantly larger journals of maybe 40GB, and with the above logic I
> would increase filestore max sync interval to 50 seconds. Is this the
> correct way of calculating ? 

In essence, yes.

>Is there any downsides of having a long filestore max sync interval ?
> 
In and by itself, not so much.

However your goal here seems to be to not "waste" lots of empty SSD space
and use it for journaling.
And a long max sync interval won't give you that.

For starters remember that Ceph journals are write only in normal
operation, they only ever get read from if there was a crash.

Writes happen to the journal(s) of all OSDs involved, then get ACK'ed to
the client, then filestore_min_sync_interval and the
various filestore_queue parameters determine when the data gets written
(from RAM) to the filestore. 
Which is pretty damn instantly, the reasoning here by the Ceph developers
is to not let the OSD fall behind too much and then have it overwhelmed by
many competing operations.

On a cluster with filestore_min_sync_interval set to 0.5 (up from it's
default of 0.01) I still don't see more than 40MB journal utilization at
peak times(sequential writes at full cluster speed), though I didn't modify
the queue parameters.

The largest utilization I've ever seen (collectd/graphite are your friends)
is 100MB in an other cluster when doing backfills.

I size my journals 10-20GB, but that's basically because I have the space.

Since SSD write speeds are pretty much tied to their size (due to
internal parallelism), only moderately large ones give you the speeds
needed to journal for several HDDs, resulting in "waste". 
One of the reasons I tend to put the OS on the same SSDs as well in RAID1
or 10 form depending on the number of SSDs.

Christian

> //Daniel
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com