Hi Christian, Interesting use case :-) How many OSDs / hosts do you have ? And how are they connected together ? Cheers On 07/10/2015 04:58, Christian Balzer wrote: > > Hello, > > a bit of back story first, it may prove educational for others a future > generations. > > As some may recall, I have a firefly production cluster with a storage node > design that was both optimized for the use case at the time and with an > estimated capacity to support 140 VMs (all running the same > OS/application, thus the predictable usage pattern). > Alas people starting running different VMs and also my request for new HW > was delayed. > > So now there are 280 VMs doing nearly exclusively writes (8MB/s, 1000 ceph > ops) and while the ceph cluster can handle this steady state w/o breaking a > sweat (avio is less then 0.01ms and "disks" are less than 5% busy). > Basically nicely validating my design for this use case. ^o^ > > It becomes slightly irritated when asked to do reads (like VM reboots). > Those will drive utilization up to 100% at times, alas avio is still > reasonable at less than 5ms. > This is also why I disabled scrubbing 9 months ago when it did hit my > expected capacity limit (and asked for more HW). > > However when trying to add the new node (much faster design in several > ways) to the cluster the resulting backfilling (when adding the first OSD > to the CRUSH map, not even starting it or anything) totally kills things > with avio frequently over 100ms and thus VMs croaking left and right. > This was of course with all the recently discussed backfill and recovery > parameters tuned all the way down. > > There simply is no maintenance window long enough to phase in that 3rd > node. This finally got the attention of the people who approve HW orders > and now the tack seems to be "fix it whatever it takes" ^o^ > > So the least invasive plan I've come up with so far is to create a SSD > backed cache tier pool, wait until most (hot) objects have made it in there > and the old (now backing) pool has gone mostly quiescent and then add > that additional node and re-build the older ones as planned. > > The size of that SSD cache pool would be at least 80% of the total current > data (which of course isn't all hot), so do people who have actually > experience with cache tiers under firefly that aren't under constant > pressure to evict things think this is feasible? > > Again, I think based on the cache size I can tune things to avoid > evictions and flushes, but if it should start flushing things for example, > is that an asynchronous operation or will that impede performance of the > cache tier? > As in, does the flushing of an object have to be finished before it can be > written to again? > > Obviously I can't do anything about slow reads from the backing pool for > objects that somehow didn't make it into cache yet. But while slow reads > are not nice, it is slow WRITES that really upset the VMs and the > application they run. > > Clearly what I'm worried about here is that the old pool > backfilling/recovering will be quite comatose (as mentioned above) during > that time. > > Regards, > > Christian > -- Loïc Dachary, Artisan Logiciel Libre
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com