Re: strange cache tier behaviour with cephfs

Christian Balzer <chibi@xxxxxxx> · Tue, 14 Jun 2016 09:15:49 +0900

Hello,

On Mon, 13 Jun 2016 16:52:19 -0700 Samuel Just wrote:

> I'd have to look more closely, but these days promotion is
> probabilistic and throttled.  
Unconfigurable and exclusively so?

>During each read of those objects, it
> will tend to promote a few more of them depending on how many
> promotions are in progress and how hot it thinks a particular object
> is.  

Again, are there any knobs to control this behavior? 
In my use case I for one would want writes to ALWAYS result in full
promotions to the cache pool if necessary.

While once I've determined that Jewel and this mode won't eat my data
I'll switch to read-forward with regards to reads.

Christian

>The lack of a speed up is a bummer, but I guess you aren't
> limited by the disk throughput here for some reason.  Writes can also
> be passed directly to the backing tier depending on similar factors.
> 
> It's usually helpful to include the version you are running.
> -Sam
> 
> On Mon, Jun 13, 2016 at 3:37 PM, Oliver Dzombic <info@xxxxxxxxxxxxxxxxx>
> wrote:
> > Hi,
> >
> > i am for sure not really experienced yet with ceph or with cache tier,
> > but to me it seems to behave strange.
> >
> > Setup:
> >
> > pool 3 'ssd_cache' replicated size 2 min_size 1 crush_ruleset 1
> > object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 190 flags
> > hashpspool,incomplete_clones tier_of 4 cache_mode writeback
> > target_bytes 800000000000 hit_set bloom{false_positive_probability:
> > 0.05, target_size: 0, seed: 0} 3600s x1 decay_rate 0 search_last_n 0
> > stripe_width 0
> >
> > pool 4 'cephfs_data' replicated size 2 min_size 1 crush_ruleset 2
> > object_hash rjenkins pg_num 1024 pgp_num 1024 last_change 169 lfor 144
> > flags hashpspool crash_replay_interval 45 tiers 3 read_tier 3
> > write_tier 3 stripe_width 0
> >
> > pool 5 'cephfs_metadata' replicated size 2 min_size 1 crush_ruleset 1
> > object_hash rjenkins pg_num 128 pgp_num 128 last_change 191 flags
> > hashpspool stripe_width 0
> >
> > hit_set_count: 1
> > hit_set_period: 120
> > target_max_bytes: 800000000000
> > min_read_recency_for_promote: 0
> > min_write_recency_for_promote: 0
> > target_max_objects: 0
> > cache_target_dirty_ratio: 0.5
> > cache_target_dirty_high_ratio: 0.8
> > cache_target_full_ratio: 0.9
> > cache_min_flush_age: 1800
> > cache_min_evict_age: 3600
> >
> > rule ssd-cache-rule {
> >         ruleset 1
> >         type replicated
> >         min_size 2
> >         max_size 10
> >         step take ssd-cache
> >         step chooseleaf firstn 0 type host
> >         step emit
> > }
> >
> >
> > rule cold-storage-rule {
> >         ruleset 2
> >         type replicated
> >         min_size 2
> >         max_size 10
> >         step take cold-storage
> >         step chooseleaf firstn 0 type host
> >         step emit
> > }
> >
> >
> >
> > [root@cephmon1 ceph-cluster-gen2]# rados -p ssd_cache ls
> > [root@cephmon1 ceph-cluster-gen2]#
> > -> empty
> >
> > Now, on a cephfs mounted client i have files.
> >
> > Read operation:
> >
> > dd if=testfile of=/dev/zero
> >
> > 1494286336 bytes (1.5 GB) copied, 11.047 s, 135 MB/s
> >
> >
> > [root@cephosd1 ~]# rados -p ssd_cache ls
> > 1000000001e.00000010
> > 1000000001e.00000004
> > 1000000001e.00000001
> > 1000000001e.0000000c
> > 1000000001e.00000008
> > 1000000001e.00000003
> > 1000000001e.00000000
> > 1000000001e.00000002
> >
> > Running this multiple times after one another, does not change the
> > content. Its always the same objects.
> >
> > -------------
> >
> > Ok, so according to the documents, writeback mode, it moved from cold
> > storeage to hot storage ( cephfs_data to ssd_cache in my case ).
> >
> >
> > Now i repeat it:
> >
> > dd if=testfile of=/dev/zero
> >
> > 1494286336 bytes (1.5 GB) copied, 11.311 s, 132 MB/s
> >
> >
> > [root@cephosd1 ~]# rados -p ssd_cache ls
> > 1000000001e.00000010
> > 1000000001e.00000004
> > 1000000001e.00000001
> > 1000000001e.0000000c
> > 1000000001e.0000000d
> > 1000000001e.00000005
> > 1000000001e.00000008
> > 1000000001e.00000015
> > 1000000001e.00000011
> > 1000000001e.00000006
> > 1000000001e.00000003
> > 1000000001e.00000009
> > 1000000001e.00000000
> > 1000000001e.0000000a
> > 1000000001e.0000001b
> > 1000000001e.00000002
> >
> >
> > So why are there now the old objects ( 8 ) plus another 8 objects ?
> >
> > Repeating this, will extend the numbers of objects endless without
> > speeding up the dd. in the ssd_cache.
> >
> > So every new dd read, of exact the same file ( to me that means, same
> > PGs/objects ) the (same) data is copied from cold pool to cache pool.
> >
> > And from there pushed to the client ( without any speed gain ).
> >
> > And thats not supposed to happen ( according to the documentation with
> > writeback cache mode ).
> >
> > Similar happens when i am writing.
> >
> > If i write, it will store the data on cold pool and cache pool equally.
> >
> > For my understanding, with my configuration, at least 1800 seconds (
> > cache_min_flush_age ) should pass by before the agent starts to flush
> > from the cache pool to the cold pool.
> >
> > But it does not.
> >
> > So, is there something specific with cephfs, or is my config just too
> > much crappy and i have no idea what i am doing here ?
> >
> > Anything is highly welcome !
> >
> > Thank you !
> >
> >
> > --
> > Mit freundlichen Gruessen / Best regards
> >
> > Oliver Dzombic
> > IP-Interactive
> >
> > mailto:info@xxxxxxxxxxxxxxxxx
> >
> > Anschrift:
> >
> > IP Interactive UG ( haftungsbeschraenkt )
> > Zum Sonnenberg 1-3
> > 63571 Gelnhausen
> >
> > HRB 93402 beim Amtsgericht Hanau
> > Geschäftsführung: Oliver Dzombic
> >
> > Steuer Nr.: 35 236 3622 1
> > UST ID: DE274086107
> >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@xxxxxxxxxxxxxx
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com