Re: Cache tier experiences (for ample sized caches ^o^)

Loic Dachary <loic@xxxxxxxxxxx> · Wed, 07 Oct 2015 07:34:16 +0200

Hi Christian,

Interesting use case :-) How many OSDs / hosts do you have ? And how are they connected together ?

Cheers

On 07/10/2015 04:58, Christian Balzer wrote:
> 
> Hello,
> 
> a bit of back story first, it may prove educational for others a future
> generations.
> 
> As some may recall, I have a firefly production cluster with a storage node
> design that was both optimized for the use case at the time and with an
> estimated capacity to support 140 VMs (all running the same
> OS/application, thus the predictable usage pattern). 
> Alas people starting running different VMs and also my request for new HW
> was delayed.
> 
> So now there are 280 VMs doing nearly exclusively writes (8MB/s, 1000 ceph
> ops) and while the ceph cluster can handle this steady state w/o breaking a
> sweat (avio is less then 0.01ms and "disks" are less than 5% busy).
> Basically nicely validating my design for this use case. ^o^ 
> 
> It becomes slightly irritated when asked to do reads (like VM reboots).
> Those will drive utilization up to 100% at times, alas avio is still
> reasonable at less than 5ms.
> This is also why I disabled scrubbing 9 months ago when it did hit my
> expected capacity limit (and asked for more HW).
> 
> However when trying to add the new node (much faster design in several
> ways) to the cluster the resulting backfilling (when adding the first OSD
> to the CRUSH map, not even starting it or anything) totally kills things
> with avio frequently over 100ms and thus VMs croaking left and right.
> This was of course with all the recently discussed backfill and recovery
> parameters tuned all the way down.
> 
> There simply is no maintenance window long enough to phase in that 3rd
> node. This finally got the attention of the people who approve HW orders
> and now the tack seems to be "fix it whatever it takes" ^o^
> 
> So the least invasive plan I've come up with so far is to create a SSD
> backed cache tier pool, wait until most (hot) objects have made it in there
> and the old (now backing) pool has gone mostly quiescent and then add
> that additional node and re-build the older ones as planned. 
> 
> The size of that SSD cache pool would be at least 80% of the total current
> data (which of course isn't all hot), so do people who have actually
> experience with cache tiers under firefly that aren't under constant
> pressure to evict things think this is feasible?
> 
> Again, I think based on the cache size I can tune things to avoid
> evictions and flushes, but if it should start flushing things for example,
> is that an asynchronous operation or will that impede performance of the
> cache tier? 
> As in, does the flushing of an object have to be finished before it can be
> written to again?
> 
> Obviously I can't do anything about slow reads from the backing pool for
> objects that somehow didn't make it into cache yet. But while slow reads
> are not nice, it is slow WRITES that really upset the VMs and the
> application they run.
> 
> Clearly what I'm worried about here is that the old pool
> backfilling/recovering will be quite comatose (as mentioned above) during
> that time.
> 
> Regards,
> 
> Christian
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com