Re: Cache tier weirdness

Nick Fisk <nick@xxxxxxxxxx> · Tue, 1 Mar 2016 09:40:30 -0000

Interesting... see below

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Christian Balzer
> Sent: 01 March 2016 08:20
> To: ceph-users@xxxxxxxxxxxxxx
> Cc: Nick Fisk <nick@xxxxxxxxxx>
> Subject: Re:  Cache tier weirdness
> 
> 
> 
> Talking to myself again ^o^, see below:
> 
> On Sat, 27 Feb 2016 01:48:49 +0900 Christian Balzer wrote:
> 
> >
> > Hello Nick,
> >
> > On Fri, 26 Feb 2016 09:46:03 -0000 Nick Fisk wrote:
> >
> > > Hi Christian,
> > >
> > > > -----Original Message-----
> > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > > Behalf Of Christian Balzer
> > > > Sent: 26 February 2016 09:07
> > > > To: ceph-users@xxxxxxxxxxxxxx
> > > > Subject:  Cache tier weirdness
> > > >
> > > >
> > > > Hello,
> > > >
> > > > still my test cluster with 0.94.6.
> > > > It's a bit fuzzy, but I don't think I saw this with Firefly, but
> > > > then
> > > again that is
> > > > totally broken when it comes to cache tiers (switching between
> > > > writeback and forward mode).
> > > >
> > > > goat is a cache pool for rbd:
> > > > ---
> > > > # ceph osd pool ls detail
> > > > pool 2 'rbd' replicated size 3 min_size 1 crush_ruleset 2
> > > > object_hash
> > > rjenkins
> > > > pg_num 512 pgp_num 512 last_change 11729 lfor 11662 flags
> > > > hashpspool tiers 9 read_tier 9 write_tier 9 stripe_width 0
> > > >
> > > > pool 9 'goat' replicated size 1 min_size 1 crush_ruleset 3
> > > > object_hash
> > > rjenkins
> > > > pg_num 128 pgp_num 128 last_change 11730 flags
> > > > hashpspool,incomplete_clones tier_of 2 cache_mode writeback
> > > > target_bytes 524288000 hit_set bloom{false_positive_probability:
> > > > 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0
> > > > ---
> > > >
> > > > Initial state is this:
> > > > ---
> > > > # rados df
> > > > pool name                 KB      objects       clones     degraded
> > > unfound           rd
> > > > rd KB           wr        wr KB
> > > > goat                      34          429            0            0
> > > 0         1051      4182046       145803
> > > > 10617422
> > > > rbd                164080702        40747            0            0
> > > 0       419664     71142697
> > > > 4430922    531299267
> > > >   total used       599461060        41176
> > > >   total avail     5301740284
> > > >   total space     5940328912
> > > > ---
> > > >
> > > > First we put some data in there with "rados -p rbd  bench 20 write
> > > > -t 32 --no-cleanup"
> > > > which easily exceeds the target bytes of 512MB and gives us:
> > > > ---
> > > > pool name                 KB      objects
> > > > goat                  356386          372
> > > > ---
> > > >
> > > > For starters, that's not the number I would have expected given
> > > > how this configured:
> > > > cache_target_dirty_ratio: 0.5
> > > > cache_target_full_ratio: 0.9
> > > >
> > > > Lets ignore (but not forget) that discrepancy for now.
> > >
> > > One of the things I have noticed is that whilst the target_max_bytes
> > > is set per pool, its actually acted on per PG. So each PG will
> > > flush/evict based on its share of the pool capacity. Depending on
> > > where data resides, PG's will normally have differing amount of data
> > > stored which leads to inconsistent cache pool flush/eviction limits.
> > > I believe there is also a "slop" factor in the cache code so that
> > > the caching agents are not always working on hard limits. I think
> > > with artificially small cache sizes, both of these cause adverse
effects.
> > >
> > Interesting, that goes a long way to explain this mismatch.
> > It is probably a spawn of the same logic that warns about too many or
> > little PGs per OSD in Hammer by averaging the numbers, totally
> > ignoring the actual usage per OSD.
> >
> 
> I did that test with increased target_max_bytes to 50GB and had about the
> same odd ratio.
> Then I thought, hmm, slightly low number of PGs for this pool, increased
it
> from 128 to 512.
> That dropped things even further, from 36GB to about 25GB (or half the
> configured target) for the eviction threshold.

I wonder if there is a way to use du to walk the PG directory tree and
confirm if any of the PG's are sitting at the 80% mark. If my rough maths
are correct each PG should evict when it has around 80-90MB in it.

Actually I wonder if that is the problem, there is not a lot of capacity per
PG if you have 512 pg's over 50G. When you are dealing with 4MB objects you
only need a few unbalanced ones and that can easily shift the percentages by
quite a bit. That might explain why it got worse when you increased the
number of PG's.

> 
> I hope in real life it will fare a little better because when dealing with
HDD
> based OSDs people are more forgiving when it comes to waste significant
> ratios of space by default (full and near_full) and even more due to
uneven
> distribution.
> 
> In the case of a quite expensive SSD based cache pool, adding the PG based
> calculation and an error ratio of 30-40% is going to cause people to game
the
> system to get the utilization they paid for.
> With of course potentially disastrous consequences.

Agree, although in my case I found it was more the flushing rather than
eviction that was more of a problem. I was always seeing random high
flushing or even worse flushing full alerts.

I did come up with the idea that in the cache tier, non dirty objects are
only stored with replica x1 and then if dirtied they are stored replica x3,
to enable you to get more usable cache for your money. However I think it's
a lot more involved that I suspected, so don't hold your breath.

Once I'm done on the promotion throttling stuff I might take a look into
this and see if I can work out exactly what's going on and confirm if it is
just unbalanced PG's or something more sinister.

> 
> Christian
> > My test cluster has 4 OSDs in the cache pool, the production one will
> > have 8 (and 2TB of raw data), and while not exactly artificially small
> > I foresee lots of parameter fondling.
> >
> > > >
> > > > After doing a read with "rados -p rbd  bench 20 rand -t 32" to my
> > > > utter bafflement I get:
> > > > ---
> > > > pool name                 KB      objects
> > > > goat                    8226          199
> > > > ---
> > > >
> > > > And after a second read it's all gone, looking at the network
> > > > traffic it
> > > all
> > > > originated from the base pool nodes and got relayed through the
> > > > node hosting the cache pool:
> > > > ---
> > > > pool name                 KB      objects
> > > > goat                      34          191
> > > > ---
> > > >
> > > > I verified that the actual objects are on the base pool with 4MB
> > > > each,
> > > while
> > > > their "copies" are on the cache pool OSDs with zero length.
> > > >
> > > > Can anybody unbaffle me? ^o^
> > >
> > > Afraid not, but I will try my best. In Hammer, you still don't have
> > > proxy writes, so that's why the write test fills up your cache tier.
> > > Proxy reads will mean that if the object is not in cache it will be
> > > retrieved from the base tier and only promoted if it gets sequential
> > > hits across the hitsets defined by min_recency.
> > >
> > Note that per the ls detail output, no recency is set, thus it SHOULD
> > not play any role here.
> >
> > > I believe that the hitsets are stored as hidden objects in the pool.
> > > Hitsets also only get created when you do IO, the agent just sleeps
> > > otherwise. I'm wondering if the hitset creation is causing the
> > > cached bench_data objects to be evicted? Ie the eviction code is
> > > assuming a hitset object is bigger than it actually is?
> > >
> > You're talking about hitset object sizes, and in the ls detail output
> > we have a "target_size: 0", alas I have not the faintest inkling what
> > this pertains to, nor does there seem to be way to configure it.
> >
> > Regards,
> >
> > Christian
> >
> > > >
> > > > Christian
> > > > --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > > > http://www.gol.com/
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@xxxxxxxxxxxxxx
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > >
> >
> >
> 
> 
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com