Hello, On Wed, 17 Feb 2016 07:00:38 -0600 Mark Nelson wrote: > On 02/17/2016 06:36 AM, Christian Balzer wrote: > > > > Hello, > > > > On Wed, 17 Feb 2016 09:23:11 -0000 Nick Fisk wrote: > > > >>> -----Original Message----- > >>> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf > >>> Of Christian Balzer > >>> Sent: 17 February 2016 04:22 > >>> To: ceph-users@xxxxxxxxxxxxxx > >>> Cc: Piotr Wachowicz <piotr.wachowicz@xxxxxxxxxxxxxxxxxxx> > >>> Subject: Re: SSDs for journals vs SSDs for a cache tier, > >> which is > >>> better? > >>> > > [snip] > >>>> I'm sure both approaches have their own merits, and might be better > >>>> for some specific tasks, but with all other things being equal, I > >>>> would expect that using SSDs as the "Writeback" cache tier should, > >>>> on average, provide better performance than suing the same SSDs for > >>> Journals. > >>>> Specifically in the area of read throughput/latency. > >>>> > >>> Cache tiers (currently) work only well if all your hot data fits into > >> them. > >>> In which case you'd even better off with with a dedicated SSD pool > >>> for > >> that > >>> data. > >>> > >>> Because (currently) Ceph has to promote a full object (4MB by > >>> default) to the cache for each operation, be it read or or write. > >>> That means the first time you want to read a 2KB file in your RBD > >>> backed > >> VM, > >>> Ceph has to copy 4MB from the HDD pool to the SSD cache tier. > >>> This has of course a significant impact on read performance, in my > >>> crappy > >> test > >>> cluster reading cold data is half as fast as using the actual > >>> non-cached > >> HDD > >>> pool. > >>> > >> > >> Just a FYI, there will most likely be several fixes/improvements going > >> into Jewel which will address most of these problems with caching. > >> Objects will now only be promoted if they are hit several > >> times(configurable) and, if it makes it in time, a promotion throttle > >> to stop too many promotions hindering cluster performance. > >> > > Ah, both of these would be very nice indeed, especially since the first > > one is something that's supposedly already present (but broken). > > > > The 2nd one, if done right, will be probably a game changer. > > Robert LeBlanc and me will be most pleased. > > The branch is wip-promote-throttle and we need testing from more people > besides me to make sure it's the right path forward <hint hint>. :) > Well, supposedly I'll be getting some real testing/staging HW in the future, which means I could use the current test cluster for really experimental stuff. Until then I need to keep it as the place to test scary procedures and being the first to test upgrades for the production clusters unfortunately. > I'm including the a link to the results we've gotten so far here. > There's still a degenerate case in small random mixed workloads, but > initial testing seems to indicate that the promotion throttling is > helping in many other cases, especially at *very* low promotion rates. > Small random read and write performance for example improves > dramatically. Highly skewed zipf distribution writes are also much > improved except for large writes). > > https://drive.google.com/open?id=0B2gTBZrkrnpZUFV4OC1UaGVlTm8 > That looks very interesting and promising indeed. Thanks for that link and the ongoing effort. Christian > Note: You will likely need to download the document and open it in open > office to see the graphs. > > In the graphs I have different series labeled as VH, H, M, L, VL, 0, > etc. The throttle rates that correspond to those are: > > #VH (ie, let everything through) > # osd tier promote max objects sec = 20000 > # osd tier promote max bytes sec = 1610612736 > > #H (Almost allow the cache tier to be saturated with writes) > # osd tier promote max objects sec = 2000 > # osd tier promote max bytes sec = 268435456 > > # M (Allow about 20% writes into the cache tier) > # osd tier promote max objects sec = 500 > # osd tier promote max bytes sec = 67108864 > > # L (Allow about 5% writes into the cache tier) > # osd tier promote max objects sec = 125 > # osd tier promote max bytes sec = 16777216 > > # VL (Only allow 4MB/sec to be promoted into the cache tier) > # osd tier promote max objects sec = 25 > # osd tier promote max bytes sec = 4194304 > > # 0 (Technically not zero, something like 1/1000 still allowed through) > # osd tier promote max objects sec = 0 > # osd tier promote max bytes sec = 0 > > Mark > > > > >> However in the context of this thread, Christian is correct, SSD > >> journals first and then caching if needed. > >> > > Yeah, thus my overuse of "currently". ^o^ > > > > Christian > >> > >>> And once your cache pool has to evict objects because it is getting > >>> full, > >> it has > >>> to write out 4MB for each such object to the HDD pool. > >>> Then read it back in later, etc. > >>> > >>>> The main difference, I suspect, between the two approaches is that > >>>> in the case of multiple HDDs (multiple ceph-osd processes), all of > >>>> those processes share access to the same shared SSD storing their > >>>> journals. Whereas it's likely not the case with Cache tiering, > >>>> right? Though I must say I failed to find any detailed info on > >>>> this. Any clarification will be appreciated. > >>>> > >>> In your specific case writes to the OSDs (HDDs) will be (at least) > >>> 50% > >> slower if > >>> your journals are on disk instead of the SSD. > >>> (Which SSDs do you plan to use anyway?) > >>> I don't think you'll be happy with the resulting performance. > >>> > >>> Christian. > >>> > >>>> So, is the above correct, or am I missing some pieces here? Any > >>>> other major differences between the two approaches? > >>>> > >>>> Thanks. > >>>> P. > >>> > >>> > >>> -- > >>> Christian Balzer Network/Systems Engineer > >>> chibi@xxxxxxx Global OnLine Japan/Rakuten Communications > >>> http://www.gol.com/ > >>> _______________________________________________ > >>> ceph-users mailing list > >>> ceph-users@xxxxxxxxxxxxxx > >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > >> > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Rakuten Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com