Re: Cache Tier or any other possibility to accelerate RBD with SSD?

Christian Balzer <chibi@xxxxxxx> · Mon, 3 Jul 2017 21:32:24 +0900

Hello,

On Mon, 3 Jul 2017 14:18:27 +0200 Mateusz Skała wrote:

> @Christian ,thanks for quick answer, please look bellow.
> 
> > -----Original Message-----
> > From: Christian Balzer [mailto:chibi@xxxxxxx]
> > Sent: Monday, July 3, 2017 1:39 PM
> > To: ceph-users@xxxxxxxxxxxxxx
> > Cc: Mateusz Skała <mateusz.skala@xxxxxxxxxxx>
> > Subject: Re:  Cache Tier or any other possibility to accelerate
> > RBD with SSD?
> > 
> > 
> > Hello,
> > 
> > On Mon, 3 Jul 2017 13:01:06 +0200 Mateusz Skała wrote:
> >   
> > > Hello,
> > >
> > > We are using cache-tier in Read-forward mode (replica 3) for
> > > accelerate reads and journals on SSD to accelerate writes.  
> > 
> > OK, lots of things wrong with this statement, but firstly, Ceph version (it is
> > relevant) and more details about your setup and SSDs used would be
> > interesting and helpful.
> >   
> 
> Sorry about this. Ceph version 0.92.1 and we plan to upgrade to 10.2.0 in short time.

I'd never run (in production) one of the short term support versions like
Kraken, you're not getting ANY bug fixes there at all.

But I guess this means that the dire warning when creating readforward
cache pools was only added with Jewel.
But the problem with that mode is of course present in all other versions
that have it.

> About the configuration:
> 4 nodes, each node with:
> -  4x HDD WD Re 2TB WD2004FBYZ, 
> -  2x SSD Intel S3610 200GB (one for journal and system with mon, second for cache-tier).
> 
> It gives 32TB RAW HDD space and only 600GB RAW SSD space, and I think it is problem with small size of cache.
> 
Don't "think" if you can quantify. Between iostat and the Ceph perf
counters you can determine how much data goes in and out of your cluster
and OSDs and how much you'd need to get through a typical day with I/O
mostly on your cache tier.

That said, 200GB effective cache space is likely to be a bottleneck, yes.

> > If you had searched the ML archives for readforward you'd come across a
> > very recent thread by me, in which the powers that be state that this mode is
> > dangerous and not recommended.
> > During quite some testing with this mode I never encountered any problems,
> > but consider yourself warned.
> > 
> > Now readforward will FORWARD reads to the backing storage, so it will
> > NEVER accelerate reads (promote them to the cache-tier).
> > The only speedup you will see is for objects that have been previously
> > written and are still in the cache-tier.
> >   
> 
> Ceph osd pool ls detail
> pool 4 'ssd' replicated size 3 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 88643 flags hashpspool,incomplete_clones tier_of 5 cache_mode readforward target_bytes 176093659136 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 120s x6 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
>         removed_snaps [1~14d,150~27,178~8,183~8,18c~12,1a0~22,1c4~4,1c9~1b]
> pool 5 'sata' replicated size 3 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 512 pgp_num 512 last_change 88643 lfor 66807 flags hashpspool tiers 4 read_tier 4 write_tier 4 stripe_width 0
>         removed_snaps [1~14d,150~27,178~8,183~8,18c~12,1a0~22,1c4~4,1c9~1b]
> 
> The setup has over 1 year. On ceph status I see flushing, promote and evicting operations. Maybe it depends on my old version? 
>  
Nothing to do with your version as far as vulnerability to the problem is
concerned.

And you see all the flushing etc because WRITES are going through your
cache-tier of course, as I stated above. 
However if your goal is to cache reads, this is the wrong mode and in
general probably a bad fit for a small cache-tier.

Christian

> > Using cache-tiers can work beautifully if you understand the I/O patterns
> > involved (tricky on a cloud storage with very mixed clients), can make your
> > cache-tier large enough to cover the hot objects (working set) or at least (as
> > you are attempting) to segregate the read and write paths as much as
> > possible.
> >   
> Have you got any good method to analyze workload? 
> I found this script https://github.com/cernceph/ceph-scripts and try to see reads and writes per length, but how to know it is random or sequential io?
> 
> > > We are using only RBD. Based
> > > on the ceph-docs, RBD have bad I/O pattern for cache tier.  I'm
> > > looking for information about other possibility to accelerate reads on
> > > RBD with SSD drives.
> > >  
> > The documentation rightly warns about things, so people don't have
> > unrealistic expectations. However YOU need to look at YOUR loads, patterns
> > and usage and then decide if it is beneficial or not.
> > 
> > As I hinted above, analyze your systems, are the reads actually slow or are
> > they slowed down by competing writes to the same storage?
> > 
> > Cold reads (OSD server just rebooted, no cache has that object in it) will
> > obviously not benefit from any scheme.
> > 
> > Reads from the HDD OSDs can very much benefit by having enough RAM to
> > hold all the SLAB objects (direntry etc) in memory, so you can avoid disk
> > access to actually find the object.
> > 
> > Speeding up the actual data read you have the option of the cache-tier (in
> > writeback mode, with proper promotion and retention configuration).
> > 
> > Or something like bcache on the OSD servers, discussed here several times
> > as well.
> >   
> > > The second question, is it any cache tier mode, that replica can be
> > > set on 1, for best use of SSD space?
> > >  
> > A cache-tier (the same true for any other real cache methods) will at any
> > given time have objects in it that are NOT on the actual backing storage when
> > it is used to cache writes.
> > So it needs to be just as redundant as the rest of the system, at least a replica
> > of 2 with sufficiently small/fast SSDs.
> >   
> 
> OK, I understand.
> 
> > With bcache etc just caching reads, you can get away with a single replication
> > of course, however failing SSDs may then cause your cluster to melt down.
> >   
> 
> I will search ML for this.
> 
> > Christian
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx   	Rakuten Communications  
> 
> 
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Rakuten Communications
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com