Cache tier experiences (for ample sized caches ^o^)

Christian Balzer <chibi@xxxxxxx> · Wed, 7 Oct 2015 11:58:08 +0900

Hello,

a bit of back story first, it may prove educational for others a future
generations.

As some may recall, I have a firefly production cluster with a storage node
design that was both optimized for the use case at the time and with an
estimated capacity to support 140 VMs (all running the same
OS/application, thus the predictable usage pattern). 
Alas people starting running different VMs and also my request for new HW
was delayed.

So now there are 280 VMs doing nearly exclusively writes (8MB/s, 1000 ceph
ops) and while the ceph cluster can handle this steady state w/o breaking a
sweat (avio is less then 0.01ms and "disks" are less than 5% busy).
Basically nicely validating my design for this use case. ^o^ 

It becomes slightly irritated when asked to do reads (like VM reboots).
Those will drive utilization up to 100% at times, alas avio is still
reasonable at less than 5ms.
This is also why I disabled scrubbing 9 months ago when it did hit my
expected capacity limit (and asked for more HW).

However when trying to add the new node (much faster design in several
ways) to the cluster the resulting backfilling (when adding the first OSD
to the CRUSH map, not even starting it or anything) totally kills things
with avio frequently over 100ms and thus VMs croaking left and right.
This was of course with all the recently discussed backfill and recovery
parameters tuned all the way down.

There simply is no maintenance window long enough to phase in that 3rd
node. This finally got the attention of the people who approve HW orders
and now the tack seems to be "fix it whatever it takes" ^o^

So the least invasive plan I've come up with so far is to create a SSD
backed cache tier pool, wait until most (hot) objects have made it in there
and the old (now backing) pool has gone mostly quiescent and then add
that additional node and re-build the older ones as planned. 

The size of that SSD cache pool would be at least 80% of the total current
data (which of course isn't all hot), so do people who have actually
experience with cache tiers under firefly that aren't under constant
pressure to evict things think this is feasible?

Again, I think based on the cache size I can tune things to avoid
evictions and flushes, but if it should start flushing things for example,
is that an asynchronous operation or will that impede performance of the
cache tier? 
As in, does the flushing of an object have to be finished before it can be
written to again?

Obviously I can't do anything about slow reads from the backing pool for
objects that somehow didn't make it into cache yet. But while slow reads
are not nice, it is slow WRITES that really upset the VMs and the
application they run.

Clearly what I'm worried about here is that the old pool
backfilling/recovering will be quite comatose (as mentioned above) during
that time.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com