Re: SSDs for journals vs SSDs for a cache tier, which is better?

Mark Nelson <mnelson@xxxxxxxxxx> · Wed, 17 Feb 2016 07:00:38 -0600

On 02/17/2016 06:36 AM, Christian Balzer wrote:

Hello,

On Wed, 17 Feb 2016 09:23:11 -0000 Nick Fisk wrote:

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
Of Christian Balzer
Sent: 17 February 2016 04:22
To: ceph-users@xxxxxxxxxxxxxx
Cc: Piotr Wachowicz <piotr.wachowicz@xxxxxxxxxxxxxxxxxxx>
Subject: Re:  SSDs for journals vs SSDs for a cache tier,
which is
better?

[snip]
I'm sure both approaches have their own merits, and might be better
for some specific tasks, but with all other things being equal, I
would expect that using SSDs as the "Writeback" cache tier should, on
average, provide better performance than suing the same SSDs for
Journals.
Specifically in the area of read throughput/latency.

Cache tiers (currently) work only well if all your hot data fits into
them.
In which case you'd even better off with with a dedicated SSD pool for
that
data.

Because (currently) Ceph has to promote a full object (4MB by default)
to the cache for each operation, be it read or or write.
That means the first time you want to read a 2KB file in your RBD
backed
VM,
Ceph has to copy 4MB from the HDD pool to the SSD cache tier.
This has of course a significant impact on read performance, in my
crappy
test
cluster reading cold data is half as fast as using the actual
non-cached
HDD
pool.

Just a FYI, there will most likely be several fixes/improvements going
into Jewel which will address most of these problems with caching.
Objects will now only be promoted if they are hit several
times(configurable) and, if it makes it in time, a promotion throttle to
stop too many promotions hindering cluster performance.

Ah, both of these would be very nice indeed, especially since the first
one is something that's supposedly already present (but broken).

The 2nd one, if done right, will be probably a game changer.
Robert LeBlanc and me will be most pleased.

The branch is wip-promote-throttle and we need testing from more people 
besides me to make sure it's the right path forward <hint hint>. :)

I'm including the a link to the results we've gotten so far here. 
There's still a degenerate case in small random mixed workloads, but 
initial testing seems to indicate that the promotion throttling is 
helping in many other cases, especially at *very* low promotion rates. 
Small random read and write performance for example improves 
dramatically.  Highly skewed zipf distribution writes are also much 
improved except for large writes).

https://drive.google.com/open?id=0B2gTBZrkrnpZUFV4OC1UaGVlTm8

Note: You will likely need to download the document and open it in open 
office to see the graphs.

In the graphs I have different series labeled as VH, H, M, L, VL, 0, 
etc.  The throttle rates that correspond to those are:

#VH (ie, let everything through)
#        osd tier promote max objects sec = 20000
#        osd tier promote max bytes sec = 1610612736

#H (Almost allow the cache tier to be saturated with writes)
#        osd tier promote max objects sec = 2000
#        osd tier promote max bytes sec = 268435456

# M (Allow about 20% writes into the cache tier)
#        osd tier promote max objects sec = 500
#        osd tier promote max bytes sec = 67108864

# L (Allow about 5% writes into the cache tier)
#        osd tier promote max objects sec = 125
#        osd tier promote max bytes sec = 16777216

# VL (Only allow 4MB/sec to be promoted into the cache tier)
#        osd tier promote max objects sec = 25
#        osd tier promote max bytes sec = 4194304

# 0 (Technically not zero, something like 1/1000 still allowed through)
#        osd tier promote max objects sec = 0
#        osd tier promote max bytes  sec = 0

Mark

However in the context of this thread, Christian is correct, SSD journals
first and then caching if needed.

Yeah, thus my overuse of "currently". ^o^

Christian

And once your cache pool has to evict objects because it is getting
full,
it has
to write out 4MB for each such object to the HDD pool.
Then read it back in later, etc.

The main difference, I suspect, between the two approaches is that in
the case of multiple HDDs (multiple ceph-osd processes), all of those
processes share access to the same shared SSD storing their journals.
Whereas it's likely not the case with Cache tiering, right? Though I
must say I failed to find any detailed info on this. Any
clarification will be appreciated.

In your specific case writes to the OSDs (HDDs) will be (at least) 50%
slower if
your journals are on disk instead of the SSD.
(Which SSDs do you plan to use anyway?)
I don't think you'll be happy with the resulting performance.

Christian.

So, is the above correct, or am I missing some pieces here? Any other
major differences between the two approaches?

Thanks.
P.

--
Christian Balzer        Network/Systems Engineer
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com