Re: SSDs for journals vs SSDs for a cache tier, which is better?

Christian Balzer <chibi@xxxxxxx> · Thu, 18 Feb 2016 11:11:18 +0900

Hello,

On Wed, 17 Feb 2016 07:00:38 -0600 Mark Nelson wrote:

> On 02/17/2016 06:36 AM, Christian Balzer wrote:
> >
> > Hello,
> >
> > On Wed, 17 Feb 2016 09:23:11 -0000 Nick Fisk wrote:
> >
> >>> -----Original Message-----
> >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
> >>> Of Christian Balzer
> >>> Sent: 17 February 2016 04:22
> >>> To: ceph-users@xxxxxxxxxxxxxx
> >>> Cc: Piotr Wachowicz <piotr.wachowicz@xxxxxxxxxxxxxxxxxxx>
> >>> Subject: Re:  SSDs for journals vs SSDs for a cache tier,
> >> which is
> >>> better?
> >>>
> > [snip]
> >>>> I'm sure both approaches have their own merits, and might be better
> >>>> for some specific tasks, but with all other things being equal, I
> >>>> would expect that using SSDs as the "Writeback" cache tier should,
> >>>> on average, provide better performance than suing the same SSDs for
> >>> Journals.
> >>>> Specifically in the area of read throughput/latency.
> >>>>
> >>> Cache tiers (currently) work only well if all your hot data fits into
> >> them.
> >>> In which case you'd even better off with with a dedicated SSD pool
> >>> for
> >> that
> >>> data.
> >>>
> >>> Because (currently) Ceph has to promote a full object (4MB by
> >>> default) to the cache for each operation, be it read or or write.
> >>> That means the first time you want to read a 2KB file in your RBD
> >>> backed
> >> VM,
> >>> Ceph has to copy 4MB from the HDD pool to the SSD cache tier.
> >>> This has of course a significant impact on read performance, in my
> >>> crappy
> >> test
> >>> cluster reading cold data is half as fast as using the actual
> >>> non-cached
> >> HDD
> >>> pool.
> >>>
> >>
> >> Just a FYI, there will most likely be several fixes/improvements going
> >> into Jewel which will address most of these problems with caching.
> >> Objects will now only be promoted if they are hit several
> >> times(configurable) and, if it makes it in time, a promotion throttle
> >> to stop too many promotions hindering cluster performance.
> >>
> > Ah, both of these would be very nice indeed, especially since the first
> > one is something that's supposedly already present (but broken).
> >
> > The 2nd one, if done right, will be probably a game changer.
> > Robert LeBlanc and me will be most pleased.
> 
> The branch is wip-promote-throttle and we need testing from more people 
> besides me to make sure it's the right path forward <hint hint>. :)
>
Well, supposedly I'll be getting some real testing/staging HW in the
future, which means I could use the current test cluster for really
experimental stuff. 
Until then I need to keep it as the place to test scary procedures and
being the first to test upgrades for the production clusters
unfortunately. 

> I'm including the a link to the results we've gotten so far here. 
> There's still a degenerate case in small random mixed workloads, but 
> initial testing seems to indicate that the promotion throttling is 
> helping in many other cases, especially at *very* low promotion rates. 
> Small random read and write performance for example improves 
> dramatically.  Highly skewed zipf distribution writes are also much 
> improved except for large writes).
> 
> https://drive.google.com/open?id=0B2gTBZrkrnpZUFV4OC1UaGVlTm8
> 

That looks very interesting and promising indeed. 
Thanks for that link and the ongoing effort. 

Christian

> Note: You will likely need to download the document and open it in open 
> office to see the graphs.
> 
> In the graphs I have different series labeled as VH, H, M, L, VL, 0, 
> etc.  The throttle rates that correspond to those are:
> 
> #VH (ie, let everything through)
> #        osd tier promote max objects sec = 20000
> #        osd tier promote max bytes sec = 1610612736
> 
> #H (Almost allow the cache tier to be saturated with writes)
> #        osd tier promote max objects sec = 2000
> #        osd tier promote max bytes sec = 268435456
> 
> # M (Allow about 20% writes into the cache tier)
> #        osd tier promote max objects sec = 500
> #        osd tier promote max bytes sec = 67108864
> 
> # L (Allow about 5% writes into the cache tier)
> #        osd tier promote max objects sec = 125
> #        osd tier promote max bytes sec = 16777216
> 
> # VL (Only allow 4MB/sec to be promoted into the cache tier)
> #        osd tier promote max objects sec = 25
> #        osd tier promote max bytes sec = 4194304
> 
> # 0 (Technically not zero, something like 1/1000 still allowed through)
> #        osd tier promote max objects sec = 0
> #        osd tier promote max bytes  sec = 0
> 
> Mark
> 
> >
> >> However in the context of this thread, Christian is correct, SSD
> >> journals first and then caching if needed.
> >>
> > Yeah, thus my overuse of "currently". ^o^
> >
> > Christian
> >>
> >>> And once your cache pool has to evict objects because it is getting
> >>> full,
> >> it has
> >>> to write out 4MB for each such object to the HDD pool.
> >>> Then read it back in later, etc.
> >>>
> >>>> The main difference, I suspect, between the two approaches is that
> >>>> in the case of multiple HDDs (multiple ceph-osd processes), all of
> >>>> those processes share access to the same shared SSD storing their
> >>>> journals. Whereas it's likely not the case with Cache tiering,
> >>>> right? Though I must say I failed to find any detailed info on
> >>>> this. Any clarification will be appreciated.
> >>>>
> >>> In your specific case writes to the OSDs (HDDs) will be (at least)
> >>> 50%
> >> slower if
> >>> your journals are on disk instead of the SSD.
> >>> (Which SSDs do you plan to use anyway?)
> >>> I don't think you'll be happy with the resulting performance.
> >>>
> >>> Christian.
> >>>
> >>>> So, is the above correct, or am I missing some pieces here? Any
> >>>> other major differences between the two approaches?
> >>>>
> >>>> Thanks.
> >>>> P.
> >>>
> >>>
> >>> --
> >>> Christian Balzer        Network/Systems Engineer
> >>> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> >>> http://www.gol.com/
> >>> _______________________________________________
> >>> ceph-users mailing list
> >>> ceph-users@xxxxxxxxxxxxxx
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com