Re: Cache Tier configuration

Mateusz Skała <mateusz.skala@xxxxxxxxxxx> · Wed, 20 Jul 2016 11:44:15 +0200

Thank You for quick response.

> -----Original Message-----
> From: Christian Balzer [mailto:chibi@xxxxxxx]
> Sent: Tuesday, July 19, 2016 3:39 PM
> To: ceph-users@xxxxxxxxxxxxxx
> Cc: Mateusz Skała <mateusz.skala@xxxxxxxxxxx>
> Subject: Re:  Cache Tier configuration
> 
> 
> Hello,
> 
> On Tue, 19 Jul 2016 15:15:55 +0200 Mateusz Skała wrote:
> 
> > Hello,
> >
> > > -----Original Message-----
> > > From: Christian Balzer [mailto:chibi@xxxxxxx]
> > > Sent: Wednesday, July 13, 2016 4:03 AM
> > > To: ceph-users@xxxxxxxxxxxxxx
> > > Cc: Mateusz Skała <mateusz.skala@xxxxxxxxxxx>
> > > Subject: Re:  Cache Tier configuration
> > >
> > >
> > > Hello,
> > >
> > > On Tue, 12 Jul 2016 11:01:30 +0200 Mateusz Skała wrote:
> > >
> > > > Thank You for replay. Answers below.
> > > >
> > > > > -----Original Message-----
> > > > > From: Christian Balzer [mailto:chibi@xxxxxxx]
> > > > > Sent: Tuesday, July 12, 2016 3:37 AM
> > > > > To: ceph-users@xxxxxxxxxxxxxx
> > > > > Cc: Mateusz Skała <mateusz.skala@xxxxxxxxxxx>
> > > > > Subject: Re:  Cache Tier configuration
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Mon, 11 Jul 2016 16:19:58 +0200 Mateusz Skała wrote:
> > > > >
> > > > > > Hello Cephers.
> > > > > >
> > > > > > Can someone help me in my cache tier configuration? I have 4
> > > > > > same SSD drives 176GB (184196208K) in SSD pool, how to
> > > > > > determine
> > > > > target_max_bytes?
> > > > >
> > > > > What exact SSD models are these?
> > > > > What version of Ceph?
> > > >
> > > > Intel DC S3610 (SSDSC2BX200G401), ceph version 9.2.1
> > > > (752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd)
> > > >
> > >
> > > Good, these are decent SSDs and at 3DWPD probably durable enough,
> too.
> > > You will want to monitor their wear-out level anyway, though.
> > >
> > > Remember, dead cache pool means unaccessible and/or lost data.
> > >
> > > Jewel has improved cache controls and a different, less aggressive
> > > default behavior, you may want to consider upgrading to it,
> > > especially if you don't want to become a cache tiering specialist.
> > > ^o^
> > >
> > > Also Infernalis is no longer receiving updates.
> >
> > We are planning upgrade in first week of August.
> >
> You might want to wait until the next version of Jewel is out, unless you have
> a test/staging cluster to verify your upgrade procedure on.
> 
> Jewel is a better choice than Infernalis, but with still a number of bugs and
> also a LOT of poorly or not at all documented massive changes it doesn't
> make me all that eager to upgrade right here, right now.
> 

We have test cluster, but without cache-tier. We will wait for stable release Jewel.

> > > > > > I assume
> > > > > > that should be (4 drives* 188616916992 bytes )/ 3 replica =
> > > > > > 251489222656 bytes *85% (because of full disk warning)
> > > > >
> > > > > In theory correct, but you might want to consider (like with all
> > > > > pools) the impact of loosing a single SSD.
> > > > > In short, backfilling and then the remaining 3 getting full anyway.
> > > > >
> > > >
> > > > OK, so better to make lower max target bates than I have space?
> > > > For
> > > example 170GB? Then I will have 1 osd reserve.
> > > >
> > > Something like this, though failures with these SSDs are very unlikely.
> > >
> > > > > > It will be 213765839257 bytes ~200GB. I make this little bit
> > > > > > lower
> > > > > > (160GB) and after some time whole cluster stops on full disk error.
> > > > > > One of SSD drives are full. I see that use of space at the osd is not
> equal:
> > > > > >
> > > > > > 32 0.17099  1.00000   175G   127G 49514M 72.47 1.77  95
> > > > > >
> > > > > > 42 0.17099  1.00000   175G   120G 56154M 68.78 1.68  90
> > > > > >
> > > > > > 37 0.17099  1.00000   175G   136G 39670M 77.95 1.90 102
> > > > > >
> > > > > > 47 0.17099  1.00000   175G   130G 46599M 74.09 1.80  97
> > > > > >
> > > > >
> > > > > What's the exact error message?
> > > > >
> > > > > None of these are over 85 or 95%, how are they full?
> > > >
> > > > Osd.37 was full on 96%, after error (heath ERR, 1 full osd).Then I
> > > > set
> > > max_target_bytes on 100GB. Flushing reduced used space, now cluster
> > > is working ok, but I want to clarify my configuration.
> > > >
> > > Don't get flushing (copying dirty objects to the backing pool) and
> > > eviction (deleting, really zero-ing, clean objects).
> > > Eviction is what frees up space, but it needs flushed (clean)
> > > objects to work with.
> > >
> >
> > OK, I understand that evicting frees space?
> >
> Yes, re-read the relevant documentation.
> 
> > > >
> > > > >
> > > > > If the above is a snapshot of when Ceph thinks something is
> > > > > "full", it may be an indication that you've reached
> > > > > target_max_bytes and Ceph simply has no clean (flushed) objects
> ready to evict.
> > > > > Which means a configuration problem (all ratios, not the
> > > > > defaults, for this pool please) or your cache filling up faster than it can
> flush.
> > > > >
> > > > Above snapshot is at this time, when cluster Is working OK.
> > > > Filling faster than flushing is very possible, when the error
> > > > become I have in config min 'promote' set at 1, like this
> > > >
> > > >     "osd_tier_default_cache_min_read_recency_for_promote": "1",
> > > >     "osd_tier_default_cache_min_write_recency_for_promote": "1",
> > > >
> > > > Now I changed this to 3, and looks like is working, 3 days without
> > > > near full
> > > osd.
> > > >
> > > There are a number of other options to control things, especially with
> Jewel.
> > > Also setting your cache mode to readforward might be a good idea
> > > depending on your use case.
> > >
> > I'm considering this move, especially we are also using SSD Journal.
> Journals are for writes, they don't affect reads, which would come from the
> HDD base backing pool.

I know that journals are for write, but if I good understand, cache-tier in writeback mode also is used for writes, so each write goes Journal SSD -> Cache Tier ->(after some time) cold storage.

> 
> >Please confirm, can I use cache tire readforward with pool size 1? It
> >is
> safe? Then I will have 3 times more space for cache tier.
> >
> Definitely not.
> Even with the best, most trusted SSDs you want a replication size of 2, so you
> can survive an OSD or node failure, etc.
I thought that in readforward mode for cache tier, failure on ssd is not affective on backing storage, and then ceph should re-read objects from this storage. What is workflow for ceph, if fails OSD from cache-tier pool in readforward mode.

> 
> 
> > > > > Space is never equal with Ceph, you need a high enough number of
> > > > > PGs for starters and then some fine-tuning.
> > > > >
> > > > > After fiddling with the weights my cache-tier SSD OSDs are all
> > > > > very close to each other:
> > > > > ---
> > > > > ID WEIGHT  REWEIGHT SIZE  USE    AVAIL  %USE  VAR
> > > > > 18 0.64999  1.00000  679G   543G   136G 79.96 4.35
> > > > > 19 0.67000  1.00000  679G   540G   138G 79.61 4.33
> > > > > 20 0.64999  1.00000  679G   534G   144G 78.70 4.28
> > > > > 21 0.64999  1.00000  679G   536G   142G 79.03 4.30
> > > > > 26 0.62999  1.00000  679G   540G   138G 79.57 4.33
> > > > > 27 0.62000  1.00000  679G   538G   140G 79.30 4.32
> > > > > 28 0.67000  1.00000  679G   539G   140G 79.35 4.32
> > > > > 29 0.69499  1.00000  679G   536G   142G 78.96 4.30
> > > > > ---
> > > > In Your snapshot used space is near equal, only 1% difference, I
> > > > have near
> > > 10% differences in used space. It depends on number of PG, or maybe
> > > weight?
> > > >
> > > As I wrote, both.
> > > 10% suggests that you probably already have enough PGs, time to
> > > fine-tune the weights, see the differences in my list above.
> > >
> > I will check this.
> >
> > > > >
> > > > > >
> > > > > >
> > > > > > My setup:
> > > > > >
> > > > > > ceph --admin-daemon /var/run/ceph/ceph-osd.32.asok config
> show
> > > > > > | grep cache
> > > > > >
> > > > > >
> > > > > Nearly all of these are irrelevant, output of "ceph osd pool ls detail"
> > > > > please, at least for the cache pool.
> > > >
> > > >
> > > > ceph osd pool ls detail
> > > > pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0
> > > > object_hash
> > > rjenkins pg_num 2048 pgp_num 2048 last_change 68565 flags hashpspool
> > > min_read_recency_for_promote 1 min_write_recency_for_promote 1
> > > stripe_width 0
> > > >         removed_snaps [1~2,4~12,17~2e,46~ad,f9~2,fd~2,101~2]
> > > > pool 4 'ssd' replicated size 3 min_size 1 crush_ruleset 1
> > > > object_hash
> > > rjenkins pg_num 128 pgp_num 128 last_change 68913 flags
> > > hashpspool,incomplete_clones tier_of 5 cache_mode writeback
> > > target_bytes 182536110080 hit_set bloom{false_positive_probability:
> > > 0.05,
> > > target_size: 0, seed: 0} 600s x6 stripe_width 0
> > > >         removed_snaps
> > > > [1~3,6~2,9~2,d~8,17~6,1f~10,33~8,3f~a,4d~2,55~22,79~2]
> > > > pool 5 'sata' replicated size 3 min_size 1 crush_ruleset 2
> > > > object_hash
> > > rjenkins pg_num 128 pgp_num 128 last_change 68910 lfor 66807 flags
> > > hashpspool tiers 4 read_tier 4 write_tier 4 stripe_width 0
> > > >         removed_snaps
> > > > [1~3,6~2,9~2,d~8,17~6,1f~10,33~8,3f~a,4d~2,55~22,79~2]
> > > >
> > > I'd go for 256 PGs, how big (OSDs) is your "sata" pool?
> > >
> >
> > "sata" pool has 16OSDs, 1024PGs
> >
> According to your output above the 'sata' pool has 128 PGs...

Yes, this is mistake. I changed this in meantime to 512PGs (only considered set to 1024PGs but not changed)

> 
> And if you meant the RBD pool, thats's still way too low, at 160 OSDs with a
> replication of 3 you should have at least 4096 PGs.
> 

There is 16 OSDs not 160 OSDs :) 

> Christian
> 
> > > Christian
> > >
> > > > Cache tier on 'ssd' pool for 'sata' pool.
> > > >
> > > > >
> > > > > Have you read the documentation and my thread in this ML labeled
> > > > > "Cache tier operation clarifications"?
> > > >
> > > > I have read documentation and some Intel blog
> > > (https://software.intel.com/en-us/blogs/2015/03/03/ceph-cache-tierin
> > > g- introduction), I will search now for Your post and read them.
> > > >
> > > > >
> > > > > >
> > > > > > Can someone help? Any ideas? It is normal that whole cluster
> > > > > > stops at disk full error on cache tier, I was thinking that
> > > > > > only one of pools can stops and other without cache tier should still
> work.
> > > > > >
> > > > > Once you activate a cache-tier it becomes for all intends and
> > > > > purposes the the pool it's caching for.
> > > > > So any problem with it will be fatal.
> > > >
> > > > OK.
> > > >
> > > > >
> > > > > Christian
> > > > > --
> > > > > Christian Balzer        Network/Systems Engineer
> > > > > chibi@xxxxxxx   	Global OnLine Japan/Rakuten
> Communications
> > > > > http://www.gol.com/
> > > >
> > > > Thank You for Your help.
> > > > Mateusz
> > > >
> > > >
> > >
> > >
> > > --
> > > Christian Balzer        Network/Systems Engineer
> > > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> >
> > Regards
> > Mateusz
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com