Re: Cache Tier or any other possibility to accelerate RBD with SSD?

Mateusz Skała <mateusz.skala@xxxxxxxxxxx> · Mon, 3 Jul 2017 14:18:27 +0200

@Christian ,thanks for quick answer, please look bellow.

> -----Original Message-----
> From: Christian Balzer [mailto:chibi@xxxxxxx]
> Sent: Monday, July 3, 2017 1:39 PM
> To: ceph-users@xxxxxxxxxxxxxx
> Cc: Mateusz Skała <mateusz.skala@xxxxxxxxxxx>
> Subject: Re:  Cache Tier or any other possibility to accelerate
> RBD with SSD?
> 
> 
> Hello,
> 
> On Mon, 3 Jul 2017 13:01:06 +0200 Mateusz Skała wrote:
> 
> > Hello,
> >
> > We are using cache-tier in Read-forward mode (replica 3) for
> > accelerate reads and journals on SSD to accelerate writes.
> 
> OK, lots of things wrong with this statement, but firstly, Ceph version (it is
> relevant) and more details about your setup and SSDs used would be
> interesting and helpful.
> 

Sorry about this. Ceph version 0.92.1 and we plan to upgrade to 10.2.0 in short time.
About the configuration:
4 nodes, each node with:
-  4x HDD WD Re 2TB WD2004FBYZ, 
-  2x SSD Intel S3610 200GB (one for journal and system with mon, second for cache-tier).

It gives 32TB RAW HDD space and only 600GB RAW SSD space, and I think it is problem with small size of cache.

> If you had searched the ML archives for readforward you'd come across a
> very recent thread by me, in which the powers that be state that this mode is
> dangerous and not recommended.
> During quite some testing with this mode I never encountered any problems,
> but consider yourself warned.
> 
> Now readforward will FORWARD reads to the backing storage, so it will
> NEVER accelerate reads (promote them to the cache-tier).
> The only speedup you will see is for objects that have been previously
> written and are still in the cache-tier.
> 

Ceph osd pool ls detail
pool 4 'ssd' replicated size 3 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 88643 flags hashpspool,incomplete_clones tier_of 5 cache_mode readforward target_bytes 176093659136 hit_set bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 120s x6 min_read_recency_for_promote 1 min_write_recency_for_promote 1 stripe_width 0
        removed_snaps [1~14d,150~27,178~8,183~8,18c~12,1a0~22,1c4~4,1c9~1b]
pool 5 'sata' replicated size 3 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 512 pgp_num 512 last_change 88643 lfor 66807 flags hashpspool tiers 4 read_tier 4 write_tier 4 stripe_width 0
        removed_snaps [1~14d,150~27,178~8,183~8,18c~12,1a0~22,1c4~4,1c9~1b]

The setup has over 1 year. On ceph status I see flushing, promote and evicting operations. Maybe it depends on my old version? 

> Using cache-tiers can work beautifully if you understand the I/O patterns
> involved (tricky on a cloud storage with very mixed clients), can make your
> cache-tier large enough to cover the hot objects (working set) or at least (as
> you are attempting) to segregate the read and write paths as much as
> possible.
> 
Have you got any good method to analyze workload? 
I found this script https://github.com/cernceph/ceph-scripts and try to see reads and writes per length, but how to know it is random or sequential io?

> > We are using only RBD. Based
> > on the ceph-docs, RBD have bad I/O pattern for cache tier.  I'm
> > looking for information about other possibility to accelerate reads on
> > RBD with SSD drives.
> >
> The documentation rightly warns about things, so people don't have
> unrealistic expectations. However YOU need to look at YOUR loads, patterns
> and usage and then decide if it is beneficial or not.
> 
> As I hinted above, analyze your systems, are the reads actually slow or are
> they slowed down by competing writes to the same storage?
> 
> Cold reads (OSD server just rebooted, no cache has that object in it) will
> obviously not benefit from any scheme.
> 
> Reads from the HDD OSDs can very much benefit by having enough RAM to
> hold all the SLAB objects (direntry etc) in memory, so you can avoid disk
> access to actually find the object.
> 
> Speeding up the actual data read you have the option of the cache-tier (in
> writeback mode, with proper promotion and retention configuration).
> 
> Or something like bcache on the OSD servers, discussed here several times
> as well.
> 
> > The second question, is it any cache tier mode, that replica can be
> > set on 1, for best use of SSD space?
> >
> A cache-tier (the same true for any other real cache methods) will at any
> given time have objects in it that are NOT on the actual backing storage when
> it is used to cache writes.
> So it needs to be just as redundant as the rest of the system, at least a replica
> of 2 with sufficiently small/fast SSDs.
> 

OK, I understand.

> With bcache etc just caching reads, you can get away with a single replication
> of course, however failing SSDs may then cause your cluster to melt down.
> 

I will search ML for this.

> Christian
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Rakuten Communications

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com